When Reliable Software Goes Bad
Back in the 1980's, misplaced user trust in the software of a radiation therapy machine led to six deaths and several other serious injuries.
The Software Reliability Paradox
In an earlier article, I described the Software Reliability Paradox:
As software reliability increases, so too does the degree of harm it has the potential to create.
Over time, users begin to take reliable software for granted. They assume it will work. They stop questioning it. Eventually, they may come to believe that the software is infallible. That's when the real danger sets in.
The Paradox in Action: Therac-25
The Therac-25 was a computer-controlled radiation therapy machine. ... It was involved in at least six accidents between 1985 and 1987, in which patients were given massive overdoses of radiation. Because of concurrent programming errors (also known as race conditions), it sometimes gave its patients radiation doses that were hundreds of times greater than normal, resulting in death or serious injury.
More details from A Gift of Fire: Social, Legal, and Ethical Issues for Computing Technology (emphasis mine):
In the first overdose incident, when the patient told the machine operator that the machine had “burned” her, the operator told her that was impossible. This was one of many indications that the makers and some users of the Therac-25 were overconfident about the safety of the system. The most obvious and critical indication of overconfidence in the software was the decision to eliminate the hardware safety mechanisms. A safety analysis of the machine done by AECL years before the accidents suggests that they did not expect significant problems from software errors. In one case where a clinic added its own hardware safety features to the machine, AECL told them it was not necessary. (None of the accidents occurred at that facility.)
The hospitals using the machine assumed that it worked safely, an understandable assumption.
Combating the Paradox
The trouble with the Reliability Paradox is that it's not enough to reduce the potential for bugs. In fact, the more reliable you make your software, the more likely user confidence turns into overconfidence.
- Identify the most critical areas of your code (where cost of failure is highest)
- Identify the most complex areas of your code (where users are least likely to notice failures)
- Areas of overlap will be most susceptible to the Reliability Paradox
- Build redundancy into the calculations for the most susceptible areas
For practical tips and ideas for how to do this, refer to the articles below:
Cover image created with Microsoft Designer