The Single Most Important Concept for Understanding Relational Databases
Unless you are introduced to databases in school, most people come to databases with an existing background in spreadsheets.
This is especially true in the Microsoft ecosystem, where Excel power users often "graduate" to Access when they start bumping into scaling issues with their Excel spreadsheets. This can cause problems for new users because the datasheet view of an Access table looks like an Excel spreadsheet.
The first thing most Excel users want to do when they come to Access is to copy and paste a big spreadsheet full of data into an Access table. Unfortunately, if you do this, you are missing out on the biggest advantage of a relational database: relationships.
Spreadsheets and databases are not the same thing.
And the single most important concept for understanding the difference is the foreign key.
Spreadsheets are "Flat" Files
An Excel spreadsheet is a flat file.
If you've ever used an advanced graphics program, you're familiar with the concept of layers. You can have one or more layers of text, a layer with the background, several layers with related shape objects, etc. With layers, it is very easy to work on one part of the picture (e.g., the text) without affecting a different part of the picture (e.g., the background) even if the two parts overlap on the x- and y-axes.
However, most standard image formats (like .jpg and .png) do not support layers. Thus, if you want to use your layered image outside of your image editor, you first need to "flatten" all the layers onto a single z-axis. This makes the image more portable, but it also makes it nearly impossible to edit the text without also altering the background.
The same concept applies to spreadsheets.
With a customer relationship management system, you need to keep track of several "layers" of information about each customer:
- Customer's name
- Contact information
- Order history
- Payment history
- Correspondence history
- Etc.
With a spreadsheet, all of these pieces of information live at the same level. Each data point is its own column. This is fine when you only have a few columns of data, maybe even a few dozen columns of data. But eventually you get so many columns of data it's no longer manageable. You also start getting a lot of duplicate information, since contact information gets repeated for a customer every time they place an order.
Just like with our image editor, managing all this data would be a lot easier if we could split it into "layers."
Databases are "Related" Tables
With a database, we separate common data into multiple tables, the same way an image editor groups common elements into multiple layers.
This helps avoid storing duplicate data, makes it easier to update existing data, and allows us to combine our data in new and interesting ways. But it's not enough to just separate data into multiple tables. We need some way to bring that data back together. We need a way to link related tables.
We link related tables with a foreign key.
Primary vs. Foreign Keys
Before we can talk too much about foreign keys, we first need to establish what a primary key is.
Primary Keys
A primary key is used to uniquely identify a row of data in a table.
I prefer to use an auto-number field as the primary key in almost all of my tables. An auto-number is a surrogate key. That's a fancy way of saying it has no meaning in the context of all the other data in the row; it's an arbitrary value.
It's a bit like the row numbers on the left side of an Excel spreadsheet. If you are talking to a colleague, you might talk about the data in row 7. The fact that the data in that particular row happens to be part of row 7 is completely arbitrary. The number 7 has no meaning other than to help you communicate to your colleague what data you are referring to.
Auto-number primary keys are like that, but with one important difference: once you assign an auto-number to a row of data, the number never changes.
Foreign Keys
A foreign key is nothing more than the primary key from the first table being stored in a field in the linked table.
This is one reason why it's so important that the primary key never changes. If it did, our related tables would be pointing at the wrong information (or no information at all).
If you have been working with large datasets in Excel, you are no doubt familiar with one or more of the Lookup functions:
VLOOKUP
XLOOKUP
INDEX/MATCH
If you have used one of the above functions to look up data from one Excel "table" and display it in a different Excel "table" then you are already familiar with the foreign key concept. Relational databases take that concept, add a bunch of safety features to it (like referential integrity), and then tack on optimized performance to top it all off.
The ability to store related data in multiple tables–like layers in an image–is what allows databases to scale so far beyond anything you can do in a spreadsheet. The ability to tie the data from multiple tables into a single result set (i.e., "queries") is what allows databases to present the data they store in interesting ways.
And the thing that makes it all possible is the foreign key.
Cover image created with Microsoft Designer