Database Design

Beware of "Over-Normalizing" Your Database

Sometimes the two primary goals of database normalization--reduced data redundancy and improved data integrity--are at odds with each other.

Mike Wolfe

Jan 2, 2023 • 3 min read

One of the most important concepts in relational databases is normalization, where you structure your tables to "reduce data redundancy and improve data integrity."

A common example of normalization is storing customer information in one table (e.g., Customer) and linking to it from a purchase table (e.g., Purchase).

Sample Customer table

CustomerID	FirstName	LastName	Address	PhoneNumber
1	Homer	Simpson	742 Evergreen Terrace	555-1212
2	Krusty	the Clown	123 Main Street	555-5678
3	Principal	Skinner	456 Elm Street	555-4321
4	Ned	Flanders	10 Evergreen Court	555-2345
5	Otto	Mann	438 Conch Street	555-7890

Sample Purchase table

PurchaseID	CustomerID	ProductName	Quantity	Price	PurchaseDate
1	3	Donut	12	$24.00	2021-01-01
2	5	Squishee	3	$6.00	2021-01-02
3	1	Duff Beer	24	$48.00	2021-01-03
4	2	Box of Lard	1	$10.00	2021-01-04
5	4	Bag of Ice	2	$4.00	2021-01-05
6	3	Box of Fireworks	1	$50.00	2021-07-01
7	1	Box of Chocolates	2	$20.00	2021-01-07

Following standard normalization guidelines, the customers' addresses are stored in a different table than the purchases. This certainly reduces data redundancy, which is one of the goals of normalization, but it may also reduce data integrity. How so?

Reduced Data Integrity

Notice that Principal Skinner (CustomerID 3) orders a dozen donuts on January 1 and a box of fireworks on July 1. Let's assume that both items were delivered. Based on the information above, one would conclude that both items were delivered to his current address at 456 Elm Street.

But what if Principal Skinner moved to a new address between January 1 and July 1?

By storing the customer address in only one place, we risk losing important information. At least in this type of scenario.

Avoiding the Information Loss

In this situation, there are (at least) three ways to redesign the database to prevent this information loss:

Less normalization
More normalization
History table

Less Normalization

The less normalized option would be to store a copy of the Customer address on the Purchase table.

This obviously results in the storage of a lot of duplicate information, but it does guarantee that we won't lose any address history.

More Normalization

With this approach, we would still store a copy of the address in both tables. Instead of storing the address directly, though, we would create a third table named Address, remove the Address field from the Customer table, and add an AddressID foreign key to both the Customer and Purchase tables.

AddressID	Address
1	742 Evergreen Terrace
2	10 Evergreen Court
3	123 Springfield Avenue
4	123 Main Street
5	456 Elm Street
6	438 Conch Street

History Table

With this final approach, we would maintain some sort of History/Audit/Archive table with records added each time the Customer table gets updated.

There are two main ways to do this:

Store a complete copy of the Customer table's fields plus a DateTime field to mark when the change occurred
Create a table with field name, time of change, before value, and (optionally) after value

There are pros and cons to each approach. However, of the three overall options, the History table will require the most effort to integrate into reporting and queries. This would be an overly complicated solution to the present problem, but it could provide enough other benefits that it ends up making the most sense.

What's the Takeaway?

Generally speaking, more normalization is better.

But you have to be careful and think through what historical information might be lost if you only store it in one place. Maybe you don't need the historical information; maybe you do. But one thing is for sure: you don't want to find out you need it only to realize you don't have it.

Epilogue: Generating Sample Data Fast

Tip of the hat to Richard Rost for inspiring me to use ChatGPT to generate the sample data in this article. I wouldn't have thought of it, but it saved me a ton of time–even though I had to really fight with it to not give me duplicate sample data (I didn't want all five customers to be the inhabitants of 742 Evergreen Terrace–Homer, Marge, Bart, Lisa, and Maggie).

Cover image created with Microsoft Designer