One of the most important concepts in relational databases is normalization, where you structure your tables to "reduce data redundancy and improve data integrity."
A common example of normalization is storing customer information in one table (e.g., Customer) and linking to it from a purchase table (e.g., Purchase).
Sample Customer table
|1||Homer||Simpson||742 Evergreen Terrace||555-1212|
|2||Krusty||the Clown||123 Main Street||555-5678|
|3||Principal||Skinner||456 Elm Street||555-4321|
|4||Ned||Flanders||10 Evergreen Court||555-2345|
|5||Otto||Mann||438 Conch Street||555-7890|
Sample Purchase table
|4||2||Box of Lard||1||$10.00||2021-01-04|
|5||4||Bag of Ice||2||$4.00||2021-01-05|
|6||3||Box of Fireworks||1||$50.00||2021-07-01|
|7||1||Box of Chocolates||2||$20.00||2021-01-07|
Following standard normalization guidelines, the customers' addresses are stored in a different table than the purchases. This certainly reduces data redundancy, which is one of the goals of normalization, but it may also reduce data integrity. How so?
Reduced Data Integrity
Notice that Principal Skinner (CustomerID 3) orders a dozen donuts on January 1 and a box of fireworks on July 1. Let's assume that both items were delivered. Based on the information above, one would conclude that both items were delivered to his current address at 456 Elm Street.
But what if Principal Skinner moved to a new address between January 1 and July 1?
By storing the customer address in only one place, we risk losing important information. At least in this type of scenario.
Avoiding the Information Loss
In this situation, there are (at least) three ways to redesign the database to prevent this information loss:
- Less normalization
- More normalization
- History table
The less normalized option would be to store a copy of the Customer address on the Purchase table.
This obviously results in the storage of a lot of duplicate information, but it does guarantee that we won't lose any address history.
With this approach, we would still store a copy of the address in both tables. Instead of storing the address directly, though, we would create a third table named Address, remove the Address field from the Customer table, and add an AddressID foreign key to both the Customer and Purchase tables.
|1||742 Evergreen Terrace|
|2||10 Evergreen Court|
|3||123 Springfield Avenue|
|4||123 Main Street|
|5||456 Elm Street|
|6||438 Conch Street|
With this final approach, we would maintain some sort of History/Audit/Archive table with records added each time the Customer table gets updated.
There are two main ways to do this:
- Store a complete copy of the Customer table's fields plus a DateTime field to mark when the change occurred
- Create a table with field name, time of change, before value, and (optionally) after value
There are pros and cons to each approach. However, of the three overall options, the History table will require the most effort to integrate into reporting and queries. This would be an overly complicated solution to the present problem, but it could provide enough other benefits that it ends up making the most sense.
What's the Takeaway?
Generally speaking, more normalization is better.
But you have to be careful and think through what historical information might be lost if you only store it in one place. Maybe you don't need the historical information; maybe you do. But one thing is for sure: you don't want to find out you need it only to realize you don't have it.
Epilogue: Generating Sample Data Fast
Tip of the hat to Richard Rost for inspiring me to use ChatGPT to generate the sample data in this article. I wouldn't have thought of it, but it saved me a ton of time–even though I had to really fight with it to not give me duplicate sample data (I didn't want all five customers to be the inhabitants of 742 Evergreen Terrace–Homer, Marge, Bart, Lisa, and Maggie).
Cover image created with Microsoft Designer