Beware of "Over-Normalizing" Your Database

Sometimes the two primary goals of database normalization--reduced data redundancy and improved data integrity--are at odds with each other.

Beware of "Over-Normalizing" Your Database

One of the most important concepts in relational databases is normalization, where you structure your tables to "reduce data redundancy and improve data integrity."

A common example of normalization is storing customer information in one table (e.g., Customer) and linking to it from a purchase table (e.g., Purchase).

Sample Customer table

CustomerIDFirstNameLastNameAddressPhoneNumber
1HomerSimpson742 Evergreen Terrace555-1212
2Krustythe Clown123 Main Street555-5678
3PrincipalSkinner456 Elm Street555-4321
4NedFlanders10 Evergreen Court555-2345
5OttoMann438 Conch Street555-7890

Sample Purchase table

PurchaseIDCustomerIDProductNameQuantityPricePurchaseDate
13Donut12$24.002021-01-01
25Squishee3$6.002021-01-02
31Duff Beer24$48.002021-01-03
42Box of Lard1$10.002021-01-04
54Bag of Ice2$4.002021-01-05
63Box of Fireworks1$50.002021-07-01
71Box of Chocolates2$20.002021-01-07

Following standard normalization guidelines, the customers' addresses are stored in a different table than the purchases.  This certainly reduces data redundancy, which is one of the goals of normalization, but it may also reduce data integrity.  How so?

Reduced Data Integrity

Notice that Principal Skinner (CustomerID 3) orders a dozen donuts on January 1 and a box of fireworks on July 1.  Let's assume that both items were delivered.  Based on the information above, one would conclude that both items were delivered to his current address at 456 Elm Street.

But what if Principal Skinner moved to a new address between January 1 and July 1?  

By storing the customer address in only one place, we risk losing important information.  At least in this type of scenario.

Avoiding the Information Loss

In this situation, there are (at least) three ways to redesign the database to prevent this information loss:

  1. Less normalization
  2. More normalization
  3. History table

Less Normalization

The less normalized option would be to store a copy of the Customer address on the Purchase table.  

This obviously results in the storage of a lot of duplicate information, but it does guarantee that we won't lose any address history.

More Normalization

With this approach, we would still store a copy of the address in both tables.  Instead of storing the address directly, though, we would create a third table named Address, remove the Address field from the Customer table, and add an AddressID foreign key to both the Customer and Purchase tables.

AddressIDAddress
1742 Evergreen Terrace
210 Evergreen Court
3123 Springfield Avenue
4123 Main Street
5456 Elm Street
6438 Conch Street

History Table

With this final approach, we would maintain some sort of History/Audit/Archive table with records added each time the Customer table gets updated.  

There are two main ways to do this:

  • Store a complete copy of the Customer table's fields plus a DateTime field to mark when the change occurred
  • Create a table with field name, time of change, before value, and (optionally) after value

There are pros and cons to each approach.  However, of the three overall options, the History table will require the most effort to integrate into reporting and queries.  This would be an overly complicated solution to the present problem, but it could provide enough other benefits that it ends up making the most sense.

What's the Takeaway?

Generally speaking, more normalization is better.  

But you have to be careful and think through what historical information might be lost if you only store it in one place.  Maybe you don't need the historical information; maybe you do.  But one thing is for sure: you don't want to find out you need it only to realize you don't have it.


Epilogue: Generating Sample Data Fast

Tip of the hat to Richard Rost for inspiring me to use ChatGPT to generate the sample data in this article.  I wouldn't have thought of it, but it saved me a ton of time–even though I had to really fight with it to not give me duplicate sample data (I didn't want all five customers to be the inhabitants of 742 Evergreen Terrace–Homer, Marge, Bart, Lisa, and Maggie).

Cover image created with Microsoft Designer

All original code samples by Mike Wolfe are licensed under CC BY 4.0