When Reliable Software Goes Bad

The Software Reliability Paradox

In an earlier article, I described the Software Reliability Paradox:

As software reliability increases, so too does the degree of harm it has the potential to create.

Over time, users begin to take reliable software for granted.  They assume it will work.  They stop questioning it.  Eventually, they may come to believe that the software is infallible.  That's when the real danger sets in.

The Paradox in Action: Therac-25

From Wikipedia:

The Therac-25 was a computer-controlled radiation therapy machine. ... It was involved in at least six accidents between 1985 and 1987, in which patients were given massive overdoses of radiation. Because of concurrent programming errors (also known as race conditions), it sometimes gave its patients radiation doses that were hundreds of times greater than normal, resulting in death or serious injury.[2]

More details from A Gift of Fire: Social, Legal, and Ethical Issues for Computing Technology (emphasis mine):

In the first overdose incident, when the patient told the machine operator that the machine had “burned” her, the operator told her that was impossible. This was one of many indications that the makers and some users of the Therac-25 were overconfident about the safety of the system. The most obvious and critical indication of overconfidence in the software was the decision to eliminate the hardware safety mechanisms. A safety analysis of the machine done by AECL years before the accidents suggests that they did not expect significant problems from software errors. In one case where a clinic added its own hardware safety features to the machine, AECL told them it was not necessary. (None of the accidents occurred at that facility.)

The hospitals using the machine assumed that it worked safely, an understandable assumption.

Combating the Paradox

The trouble with the Reliability Paradox is that it's not enough to reduce the potential for bugs.  In fact, the more reliable you make your software, the more likely user confidence turns into overconfidence.  

  1. Identify the most critical areas of your code (where cost of failure is highest)
  2. Identify the most complex areas of your code (where users are least likely to notice failures)
  3. Areas of overlap will be most susceptible to the Reliability Paradox
  4. Build redundancy into the calculations for the most susceptible areas

For practical tips and ideas for how to do this, refer to the articles below:

Reduce Logic Errors in Critical Code
Software Developers Can Almost Eliminate Logic Errors With This Powerful Technique
5 Ways to Reduce Logic Errors Using Automated Double-Checks
Identify the critical functions in your application. Then, apply one or more of these techniques to ensure that if they break, someone will notice.
Defensive Programming - No Longer Set
In this series of articles, learn how to design code to turn expensive, hard-to-fix errors into those that are cheaper and easier to fix.

Additional reading

Killed By A Machine: The Therac-25
The Therac-25 was not a device anyone was happy to see. It was a radiation therapy machine. In layman’s terms it was a “cancer zapper”; a linear accelerator with a human as its target. Using …

Cover image created with Microsoft Designer