Commentary

The Software Reliability Paradox

The most reliable software holds the potential to cause the greatest harm. Examples abound, from my own $86K mistake to a devastating Russian hack.

Mike Wolfe

Sep 22, 2021 • 5 min read

As software reliability increases, so too does the degree of harm it has the potential to create.

Let's break this down because I chose my words with great care. Note that I am not suggesting that more reliable software is more likely to create harm. In fact, the opposite is true. Reliable software is less likely to create harm than unreliable software.

And therein lies the rub.

Over time, users begin to take reliable software for granted. They assume it will work. They stop questioning it. Eventually, they may come to believe that the software is infallible. That's when the real danger sets in.

Users of the most reliable software will assume that any unusual results it produces must be an indication of their own faulty thinking.

Low Expectations Limit Consequences

The concept becomes clearer if we consider the alternative: unreliable software.

Imagine you are a doctoral student and you have to write a several-hundred-page thesis for your PhD. You have an eccentric advisor who insists that you use the open source LaTeX program she wrote. You download it from SourceForge (because where else would you find an ill-conceived hobby software project from the late '90s?). The software crashes three times before you complete a single page of text.

Are you going to trust your thesis to a single large file saved in the proprietary format of this unreliable software? Nope. You're going to save backup copies every ten minutes. You're going to keep all the content in a plain text file. You're going to acquire a manual typewriter from the Russian stockpile and forego the digital world entirely.

You can't possibly lose all your work because you don't dare trust the software enough to take that chance.

A False Sense of Security

If monsters attacked my kids in the middle of the night, my kids would be woefully unprepared. And it would be my own fault.

I have four kids. They are 16, 12, 9, and 7. When they were younger, they were wary of monsters that might be lurking under their beds or in their closets. I used my special dad powers to ward off the beasts. Nights turned to weeks. Weeks turned to months. Months turned to years. Despite the occasional false alarm, the once-feared monsters never appeared.

Now I'm the one lying awake at night worrying about the monsters because–when they do attack–my kids will never see them coming. I systematically trained that fear out of them. I am a victim of my own success as a father.

FDR was wrong; a little bit of fear is a good thing.

My $86,000 Mistake

I learned this lesson the hard way in my professional career.

I wrote an Access application that calculated accrued interest on large ($1M+) financial securities for a community bank. While refactoring the code to deal with a corner case, I accidentally introduced a logic error into the accrual calculation for a subset of the securities.

There was a manual monthly reconciliation process that should have caught my mistake. But that process was a low priority, especially since it had been years since it had turned up any problems. Long story short, by the time someone manually double-checked my software's results, they realized that the bank had been over-reporting its interest income to the tune of $86,000.

I came home that night and told my wife that I might be out of a job. (... And yet here I am!)

Now, I know what you may be thinking. My software obviously wasn't that reliable if it contained a logic error that could lead to such a large mistake. My point, though, is that if my software had been less reliable prior to the big mistake, the client would have caught it sooner.

If my software was less reliable, this section would be titled, "My $7,000 Mistake."

The SolarWinds Hack

This principle holds for large software projects, too.

Prior to its being hacked, SolarWinds' Orion was a well-respected software tool that managed service providers and cybersecurity firms used to manage the devices on their (and their clients') networks. The Russian cyberattack targeted one of the most reliable and mundane features of the software: its automatic update process.

Microsoft, Cisco, and Intel–along with the U.S. Departments of Treasury, Justice, Energy, and Defense–were all victims of the attack. Even the Cybersecurity and Infrastructure Agency (CISA), whose mission is to protect U.S. interests from cyberattack (!), fell victim to the attack.

For months, the Russian hackers had access to the victims' computer systems. The intrusion went undetected for so long in part because the attackers went to great lengths to ensure the Orion software continued to perform its primary purpose reliably.

SolarWinds Orion was extremely reliable software.

Accordingly, the hack of Orion was extremely harmful.

The Monster, The Cage, and The Software

Let's end with a story about the Monster, the Cage, and the Software.

The Cage represents the reliability of the Software. The Monster is the harm the Software has the potential to create.

Unreliable Software has a small Cage. The Monster need not be very big to break out of the small Cage. In fact, the Cage is so small that as the Monster grows, he can't help but break out of the Cage. If he does break out, though, he's not big enough to do much damage.

Reliable Software has a big Cage. A small Monster will have trouble breaking out of the big Cage. But the big Cage offers plenty of room for the Monster to grow. By the time the Monster is big enough to break out of the big Cage, it's going to be one big Monster.

So, by all means, build a bigger Cage for your Software. Just keep in mind what you might be growing in there...