Commentary

The Curse of Reliable Software

How does one avoid the reliability paradox? One option is to intentionally write unreliable, buggy software. There's a better option.

Mike Wolfe

Oct 4, 2020 • 5 min read

I take a lot of pride in crafting reliable software. I always use version control, which helps prevent changes from "sneaking in" to my codebase. I maintain separate development environments so I can test my code manually, even if I don't create nearly as many automated tests as I know I should. All in all, I think I do a pretty good job.

But I am far from perfect. It's my curse as a human being. I make mistakes. Sometimes I make big mistakes with minor consequences. Sometimes I make little mistakes with major consequences. That's the problem with software; even the tiniest error can have major consequences.

Mess Up Mundane Detail GIF from Messup GIFs

My $86,000 mistake

Many years ago, when I was about four years into my job as an Access developer, I was updating an interest accrual calculation in a loan management program we had built for a client. The calculation involved nested loops. I don't remember the details of the mistake. I think there was an accumulating variable that needed to be reset a certain way for a small subset of loans. I made the change without realizing the impact it would have on a different small subset of loans.

It was a little mistake. But it went unnoticed for more than six months. By the time the client realized the error, they were overreporting their interest income by $86,000. I went back into the code hoping against hope that I had not introduced that error, but, alas, it was indeed my fault.

I had many uncomfortable conversations with our client in the days and weeks that followed. He asked at one point how good our business liability coverage was. I was still an employee at the time, so I honestly didn't know. But I think the question had it's intended effect. This was a serious issue and he wanted to be sure I understood that.

Their $86,000 mistake

The thing that I often thought--but never said--during this entire ordeal is that I was not the only one to blame. Multiple people in the accounting department had seen the reports and nobody had noticed the problem.

They had techniques for verifying the numbers that our software was calculating. In fact, they did exactly that for several months when they first started using the software. But after awhile, it felt like wasted energy. The numbers always matched. Why did they spend money commissioning the software if they weren't going to trust it?

So, over time, our software became a victim of its own reliability. It became infallible in the eyes of our users. Their initial skepticism turned to trust which eventually turned to complacency.

The immutable truths

What I learned from this episode is that there are a couple of immutable truths:

Over time, users will assume their software is infallible.
Over time, software will prove its users wrong.

It's not enough to warn users that there could be an error and they need to verify the software's results. You might as well put that warning that they're not going to listen to in a startup message box that they're not going to read.

It's not enough to simply add more tests and be more careful. You're not going to create flawless software. If you're a one-man shop or a small dev team creating custom software applications, then you're definitely not going to be able to do that.

What can we do?

If we operate from a place of accepting the two truths above, there are still ways to improve our software.

Unit and integration tests are one way to improve the situation, but they have a fatal flaw. No matter how many tests you write, you can't write a test for every unforeseen possibility. Think of it as the reverse pigeonhole principle. Or, as Don Rumsfeld might say, you can't test for the unknown unknowns.

I actually prefer a different method. Just to be pretentious, let's call it "runtime testing." The implementation may vary, but the key concept is that we calculate important numbers at least two different ways.

Runtime testing

Calculating important numbers at least two different ways.

Runtime testing example

The idea is to come up with two different sets of steps to arrive at a single number. If the number does not match, then we alert the user in some obvious way.

Here is a classic example. Consider a permit office. Builders come in and request permits for the projects they are working on. A builder may pay for five $50 permits with a single $250 check. The user of our software creates five separate $50 entries, one for each permit. The user then creates a sixth entry in a check ledger for $250.

At the end of the day, the user runs a report showing the total of all the checks received that day in the check ledger. The software may even print out a hard copy deposit slip to take to the bank. On that same report, we would also show the total receipts for the day. In this example, both numbers would be $250.

Let's say a different builder comes in the next day and hands over a $100 check for two permits. It turns out that one of those permits is a re-issue, so it only costs $40. Our software prefills $40 for one permit and $50 for the other permit. The clerk then enters the $100 check in the check ledger. The builder walks out having overpaid by $10, but nobody realizes the mistake.

At the end of the day, the user runs our daily balance report. Unlike yesterday, today's numbers don't match. The check total is $100 but the receipt total is $90. The report highlights the $10 difference in bold with a yellow background so the user can't possibly miss it (or at least can't claim so with a straight face).

The beauty of this approach is that it can take into account not only potential program errors but also human errors.

Final Thoughts

Writing reliable software presents a paradox. The more reliable the software, the more your users will trust it. The more your users trust it, the less they will question it. The less they question it, the more likely an error will go unnoticed. And if the error goes unnoticed, you suddenly find yourself with unreliable software.

How does one avoid the reliability paradox? One option is to intentionally write unreliable, buggy software. That way your users never trust it and they're always verifying that it's working correctly. Hmmm... maybe all that software I thought was crappy is actually secretly brilliant?!?!

The far better option is to be aware of the paradox and take steps to break out of it. Empower your users to avoid complacency by implementing runtime testing into all your important calculations.

And buy a good errors and omissions policy. :-)

Image by Gerhild Klinkow from Pixabay