Data Validity Checks

Increase the quality of your existing data by running it through a series of validity checks and showing your users the results.

Data Validity Checks

The only thing worse than missing data is bad data.

If data is missing, that is a known problem.  We can account for that.

If data is bad, that's a problem that may go unnoticed.

A good way to clean up bad data–and keep it clean moving forward–is with data validity checks.

What's a Data Validity Check?

Data validity checks (or simply "validity checks") are nothing more than queries that identify invalid data.

The concept is similar to the test-driven design approach that I described in an earlier article series, "VBA to T-SQL via TDD." In Step 6 of that process, I write about creating a query to return the failing tests in our test table.  Data validity checks serve much the same purpose, but we write them for a different audience: the end user.

A data validity check is a SELECT query where every record returned represents a row of invalid data.

Potential Uses for Validity Checks

Delaying "required" fields

When designing user interfaces, we often have to make tradeoffs between usability and data cleanliness.

For example, in a table of US addresses, we may want the ZIP code to be a required field.  But what happens if our user knows every part of a new address except for the ZIP code?  If we make the ZIP code a requirement, then the user has to enter something.  That something is likely going to be a dummy ZIP code like 00000 or 12345.  

In other words, making the ZIP code a required field actually reduces our data quality because it encourages users to enter bogus information.

A far better approach is to allow the users to create the new address and leave the ZIP code blank.  We would then provide the users with a validity check that identifies all the records with missing ZIP codes:

SELECT * 
FROM tblAddress
WHERE ZipCode Is Null

The user would run the check periodically and insert the missing ZIP codes as they became available.

Legacy database limitations

When you are taking over a new project, often one of the first items on the agenda is to scrub the existing data looking for problems.

In some cases, you can add referential integrity to improve the quality of the data.  Most of the time, though, the data problems you identify will require some level of manual intervention.  In those cases, a good validity check will provide enough context for your users to fix the data themselves.

Multi-step processes

Perhaps your application contains a step-by-step process.  In Step 1, the user is expected to create and populate half the fields of a new record.  In a later step, the user is required to go back and fill in the missing fields of the record created in Step 1.

A validity check in this case would help ensure that no records fall through the cracks between Step 1 and the end of the multi-step process.

Validity Checks Do NOT Replace Referential Integrity

Do not rely on validity checks to enforce relationships among your tables; that is the role of referential integrity.

Rather, write validity checks to help fill the gaps where the referential integrity cannot get the job done on its own.


Referenced articles

VBA to T-SQL via TDD - No Longer Set
In this series of articles, I go step-by-step explaining how you can re-write VBA functions as SQL Server scalar functions to increase performance without the risk of introducing logic errors during the migration.
VBA to T-SQL via TDD: Step 6
Step 6. Build a test query in SQL Server to return failing tests

Image by Engin Akyurt from Pixabay

All original code samples by Mike Wolfe are licensed under CC BY 4.0