CrowdStrike and Delta: Victims of the Software Reliability Paradox

The recent CrowdStrike debacle likely cost the global economy billions of dollars, half a billion of which was borne by a single company.  How can one relatively innocuous mistake have such dire financial consequences?

It's a perfect example of the Software Reliability Paradox:

As software reliability increases, so too does the degree of harm it has the potential to create.

The principle is more intuitive if you think about it in terms of Unreliable Software.  If your software is unreliable, you won't trust it with anything important.  

Conversely, the more reliable your software is, the more you will trust it.  As it continues to prove itself reliable over time, you will continue to trust it with more and more responsibility.

But software is a fickle beast.

Anyone who's spent five hours tracking down a bug that was ultimately fixed by inserting or deleting a single character can tell you that.

Taming the Software Beast

So how does one tame the software beast?  

Ideally, with defenses in depth:

Good processes

Good coding practices

Good dose of humility

A Cage for the Beast

The above practices work together to form a metaphorical cage to control your software beast.  

The bigger and stronger the cage, the safer your users are.  And the safer your users are, the more comfortable they become feeding your beast.  They are happy to let it grow and take on more responsibility.  They trust it.  

Given enough time and a big enough cage, they'll even bet the future of their company on it.

But alas, the curse of reliable software reveals two immutable truths:

1. Over time, users will assume their software is infallible.
2. Over time, software will prove its users wrong.

The Bigger the Cage, the Bigger the Beast

From The Software Reliability Paradox:

The Cage represents the reliability of the Software.  The Monster is the harm the Software has the potential to create.

...

Reliable Software has a big Cage.  A small Monster will have trouble breaking out of the big Cage.  But the big Cage offers plenty of room for the Monster to grow.  By the time the Monster is big enough to break out of the big Cage, it's going to be one big Monster.

So, by all means, build a bigger Cage for your Software.  Just keep in mind what you might be growing in there...

CrowdStrike: "Life Finds a Way"

This brings us back to the recent CrowdStrike meltdown.

CrowdStrike had proved itself so reliable for so long, that companies like Delta were comfortable enough to let them push automated driver updates to their workstations.

As Delta and others learned the hard way, a bug in a Windows driver can prevent the machine from even booting.  And if such a bug gets into the wild, it CANNOT BE FIXED with an automated deployment.

If you are letting a vendor roll out automated driver updates, you are trusting them with a process that can break your machines en masse, but which can only be fixed one machine at a time.

It's the ultimate in software trust.

As beasts go, automated Windows driver updates are the software equivalent of a tyrannosaurus rex.

CrowdStrike spent more than a decade earning the trust that they could build a cage big enough to contain such a beast.

But, as Jeff Goldblum's character, Dr. Malcolm, famously warned us in Jurassic Park, "Life finds a way."

[Dr. Malcolm]: "If there's one thing the history of evolution has taught us it's that life will not be contained.  Life breaks free. It expands to new territories.  It crashes through barriers, painfully–maybe even dangerously."

[Henry Wu]: "You're implying that a group composed entirely of female animals will breed?"

[Dr. Malcolm]: "No I'm simply saying that life finds a way."

On July 19, 2024, CrowdStrike's software beast finally broke through its enormous cage.

As Dr. Malcolm would say, "Life found a way."


Referenced Articles

2024 CrowdStrike incident - Wikipedia
Delta CEO lashes out at CrowdStrike: This cost us $500 million and they offered us nothing | CNN Business
The CEO of Delta Air Lines lashed out at cyber security firm CrowdStrike and software provider Microsoft for the computer problems that resulted in a service meltdown he disclosed cost the airline $500 million due to a five-day service meltdown.
Defensive Programming
Don’t build digital Maginot Lines. Program your defenses in depth.
The Curse of Reliable Software
How does one avoid the reliability paradox? One option is to intentionally write unreliable, buggy software. There’s a better option.
The Software Reliability Paradox
The most reliable software holds the potential to cause the greatest harm. Examples abound, from my own $86K mistake to a devastating Russian hack.