This is one in a series of articles on technical debt.

Before we talk about denormalized tables, we should define normalization itself.

What is database normalization?

"Database normalization" sounds stuffy.  It sounds like the kind of thing that only matters in the ivory towers of academia.  It's the sort of thing you might be tempted to disregard in the "real world" where actual work gets done.  That would be a very bad idea (see, for example, my own "Cautionary Tale" below).

Database normalization is also not as complicated as it sounds, especially if you focus on applying the concepts and ignore the terminology.  Identifying which specific "normal form" your data satisfies can rightly be considered an academic exercise.  Your real-world goal should be to apply the concepts of normalization to achieve the highest practical normal form that makes sense for your data.

Wikipedia provides an excellent step-by-step example of normalizing data.  With few exceptions, I would recommend you normalize your data into at least third normal form.  

But do I really need to normalize my tables?

It can be tempting at the start of a small project to shortcut the database design process.  Maybe all you're doing is converting a couple of Excel workbooks into a tiny Access application to speed up some recurring data processing tasks.  The data in the Excel workbooks is almost certainly denormalized.  Won't it be easier to just migrate the data by creating a table for each worksheet and a field for each column?

Yep.  It will be easier to do that now.  But it will make things more painful for you in the future.  That's why we call it technical debt.

How do denormalized tables contribute to technical debt?

Using denormalized tables results in an:

  1. Inflexible schema full of
  2. Unreliable data that
  3. Scales poorly

Inflexible schema

Failing to normalize your data often leads to a few massive tables with dozens and dozens of fields each.  With such sprawling tables, any change you make will affect many more queries, forms, and reports than a similar change made to a properly normalized table with fewer fields.

Unreliable data

It is harder to enforce constraints and proper relationships among your data when it is packed into a handful of mega tables.  Referential integrity is a powerful tool you can use to place limits on how your data interacts, but it's only useful with normalized data.

Scales poorly

Denormalized data generally requires much more disk storage space than the same information properly normalized.  That's primarily due to the duplication of data inherent to denormalized tables.  This can be an even bigger concern when the backend data resides on a different machine.  Operations that transfer every field from a table will run noticeably slower on tables with many fields.

A Cautionary Tale

Accumulating debt

I was recently updating a program of ours that a client uses to track official checks.  When we first wrote the program, all the checks were drawn from a single account.  There were two tables, Branches and Checks, with a one-to-many relationship between them.

After a few years, the client wanted to start tracking money orders, too.  Money orders were very similar to checks, but they were drawn from a different account.  Rather than create another table, we just added a CheckType field to distinguish between official checks and money orders.

A few more years went by and the official checks account got compromised.  This meant that official checks up to a certain date came out of one account, but all official checks after that date came out of a second account.  Rather than create another table, we just added logic at the application layer to perform an account lookup based on the cutoff date.

After a few months, we got a duplicate entry in the unique key of our Checks table.  It turns out that when the account was compromised, the newly ordered checks had check numbers that overlapped with the previous account.  Suddenly, our clever idea to deal with the account compromise in the application logic didn't look so clever anymore.  

Rather than fix it properly, though, we kicked the can down the road a little more.  We piled up more technical debt by adding an artificial "version" field to the Checks table.  We made the new field part of the unique index and populated it with "1" for every existing record.  The duplicate check number had its "version" field value set to "2".  Future duplicates of the check number would continue receiving successive version numbers.  

This was a horrible kludge.  I knew it at the time, but I didn't care.  It was expedient.  It solved the immediate problem of the duplicate value in the unique index.  If only that were the end of the story, I might not be writing this particular article today.

Paying off the debt

Several more years passed, and the official checks account was compromised a second time.  "No problem," I thought, "my entirely meaningless version field will allow yet more duplicate check numbers to be created if needed."  That's when my client informed me that instead of the new checks being drawn from a single new account, each of their seven branches would be drawing checks from separate accounts.

To their credit, the client asked me first if it would be a problem.  To my detriment, I responded that it would be no problem at all.  Boy was I wrong.

All those years of piling on technical debt were coming back to haunt me.  You see, this particular application included a reconciliation process that happens at the account level.  Seven accounts means seven separate reconciliations.  Unfortunately, years of quick fixes had left my database schema in an ugly state.

I finally realized that I needed to fix things properly, before the data itself became so unreliable that my program started introducing errors into the process.  To do this, I've had to add two new tables:

  • Accounts
  • Branch_Accounts

The Accounts table holds the account number and related information that I was previously hard-coding into my application logic layer.  The Branch_Accounts table is a linking table to support the many-to-many relationship between the new Accounts table and the existing Branches table.  The Checks table is getting a new AccountID field linking each check record to the account to which it belongs.

A word to the wise

Because I waited so long to address the systemic problems of my database design, the current fix requires multiple changes to multiple tables that affect multiple applications.  Coordinating all of these changes without making things worse is a stressful and technically fraught endeavor.  And it's not one I would wish on my worst enemy, let alone you, dear reader. as I say, not as I've done:

Normalize your database tables to avoid technical debt.

Photo by Whitney Greenwell from Pexels