Reader Question: How to Match Names from Multiple Systems

It's a common data cleansing challenge: how do you eliminate duplicate data from multiple systems without accidentally deleting unique data?

Reader Question: How to Match Names from Multiple Systems

Reader Lance McGonigal wrote in with the following question (shared with permission):

I have an enterprise-wide project that seeks to consolidate different names sets across about 20 departments for a single customer invoicing experience.  It's quite a challenge because the data is incomplete (names without ssns or dob) and/or inconsistent (addresses have written out differently, ie...123 W Main St vs 123 West Main Street).

I've looked at functions such as soundex, metaphone/double metaphone, etc...without much joy.  What experience/advice do you have regarding this subject?

That is one juicy question, Lance!

I started composing an email response, and by the time I finished it, I had written the equivalent of an entire article.  Check out my advice below, lightly edited from my response to Lance.


Clarifying Questions

I would have several follow-up questions for you.  You may find the questions themselves help you consider the pros and cons of various solutions (as there is no one right answer anyway).

  • Is this a one-time project or an ongoing challenge?
  • How important is it that there are no duplicates (i.e., multiple records that refer to the same physical person)?
  • How important is it that there are no incorrect merges (i.e., multiple physical people merged into a single record)?
  • How many records are we talking about (hundreds, thousands, millions)?

Assumed Answers

Since you mentioned it is for an enterprise-wide invoicing application, I'll assume the following answers:

  • Ongoing challenge
  • Critically important
  • Critically important
  • Thousands to start, dozens added each day/week

High-Level Approach

Since I'm assuming mistakes are unacceptable--and since you've already pointed out the challenges to a fully automated solution--I think any solution will need to incorporate human oversight.

I'll start by assuming that 10 of the departments are storing contact information in a table named Customer and the other 10 departments are storing data in a table named Contact.  The concepts below would work just as well with a single source table or many source tables, but assuming two tables will simplify the discussion around the concept.

Database Design

Entity Table

I would start by creating a new table, named Entity, to serve as the sole source of truth for the database.  The Entity table would store sanitized contact data only.  

You would add an optional EntityID field to the existing Customer and Contact tables.  The fields would be NULL until you or the user identifies a positive match in the Entity table. Each record in the Entity table should represent a single physical person (or company, etc.).  The primary key will be an autonumber named EntityID, but it is very important that you have some other unique combination of fields that will allow you to distinguish among similar entries.  This should be enforced via a unique compound key.

EntityMatch Table

Next, I would add an EntityMatch table.  This table would have the following fields:

  • EntityMatchID: autonumber primary key
  • EntityID: foreign key to Entity table
  • SourceTableName: name of the source table (e.g., "Customer" or "Contact")
  • SourceID: foreign "key" to Customer/Contact table (I put "key" in scare quotes because you won't be able to enforce referential integrity if pointing to multiple tables; if there is only a single source table, this could be an enforced foreign key)
  • Confidence: tinyint: number from 0 to 100 to indicate your confidence that the two records are a match, with 100 meaning 100% confidence
  • Notes: nvarchar(255): short description of how the two records appear to be related (e.g., "Soundex match on name but different spelling (Wolf vs. Wolfe)" with a corresponding 20% confidence)
  • SourceEntityMatchID: self-referencing key populated only when user chooses a partial match (see algorithm below for details)

Algorithm for Automated Matching

You would then start populating the Entity and EntityMatch tables.  Here's the algorithm for the automated matching:

  • Loop through every record in the Customer table with a NULL EntityID and no existing record in the EntityMatch table where EntityMatch.CustomerID matches Customer.CustomerID (this allows the process to easily be stopped and restarted)
  • Check for possible matches in the Entity table
    • If there is no possible match:
      • Create a new record in the Entity table
        • Set Entity fields to the fields in the Customer table
        • Set Customer.EntityID to the newly created EntityID autonumber
      • Create a new record in the EntityMatch table
        • Set EntityMatch.EntityID to the newly created EntityID autonumber
        • Set SourceTableName to "Customer"
        • Set SourceID to Customer.CustomerID
        • Set Confidence to 100
        • Set Notes to "Entity record created from Customer record"
    • If there is a 100% match:
      • Get the EntityID of the matching Entity record
      • Set Customer.EntityID to the matching Entity.EntityID value
      • Create a new record in the EntityMatch table
        • Set EntityMatch.EntityID to the matching Entity.EntityID value
        • Set SourceTableName to "Customer"
        • Set SourceID to Customer.CustomerID
        • Set Confidence to 100
        • Set Notes to "Matching Full name, DoB, and SSN" (or whatever criteria were used)
    • If there is a partial match:
      • Get the EntityID of the partially matching Entity record
      • Leave Customer.EntityID set to NULL
      • Create a new record in the EntityMatch table
        • Set EntityMatch.EntityID to the partially matching Entity.EntityID value
        • Set SourceTableName to "Customer"
        • Set SourceID to Customer.CustomerID
        • Set Confidence to something between 1 and 99
        • Set Notes to "Soundex name match (Smith vs. Smyth) but different spelling" (or whatever criteria were used)
    • If there are additional partial matches:
      • Create new records in the EntityMatch table for every partial match
  • Repeat the loop for the Contact table

Human Oversight

Next, create a form that allows an end user to evaluate all of the partial matches and decide whether any of the partial matches is an actual match or if a new record will need to be created.  This form is based on a record source that is an INNER JOIN between Customer and EntityMatch on CustomerID where Customer.EntityID Is Null.  

For each Customer record, show the user the information from the Customer table above a subform with all the partial matches of the EntityMatch table, sorted in order by descending EntityMatch.Confidence:

  • User selects an existing partial match and marks it as a full match
    • Set Customer.EntityID to the EntityID of the selected match
    • Create a new record in the EntityMatch table (this preserves the original partial match record which will allow us to analyze our confidence levels and adjust as we accumulate data)
      • Set EntityMatch.EntityID to the user-selected match
      • Set SourceTableName to "Customer"
      • Set SourceID to Customer.CustomerID
      • Set Confidence to 100
      • Set Notes to "User-selected match from " & EntityMatch.EntityID
      • Set SourceEntityMatchID to the EntityMatchID of the user-selected match
  • User clicks a button named [Create New Entity]
    • This occurs when the user decides that none of the partial matches is an actual match
    • Create a new record in the Entity table
      • Set Entity fields to the fields in the Customer table
      • Set Customer.EntityID to the newly created EntityID autonumber
    • Create a new record in the EntityMatch table
      • Set EntityMatch.EntityID to the newly created EntityID autonumber
      • Set SourceTableName to "Customer"
      • Set SourceID to Customer.CustomerID
      • Set Confidence to 100
      • Set Notes to "Entity record created from Customer record"

Advantages of this Approach

This approach has several advantages:

  • It supports both a one-time migration and ongoing maintenance.
  • It involves a human when we can't identify matches with 100% confidence (or mismatches with 100% confidence).
  • The EntityMatch table and its Confidence field allow us to use multiple approaches to identify potential matches (e.g., Soundex, metaphone, machine learning, etc.).
  • New matching algorithms can easily be added in the future.
  • The process can easily be stopped and restarted.
  • The automated matching process will be relatively slow, but can be run during off-peak hours and/or in parallel on multiple machines.
  • The human-involved portion will be very responsive because it avoids complicated joins or calculations.
  • The human-involved portion can be run in parallel on multiple machines with multiple humans.
  • Over time, we can perform analysis to see how accurate our confidence levels are via the self-referencing EntityMatch.SourceEntityMatchID field and tweak those functions as needed.

Additional Reading

Two Ways to Burn
Burning cardboard is a lot like a large data migration. You can do it all at once or slowly over time.

Reader Feedback

How would you tackle this type of challenge?  Have you done something similar in the past?  If so, how did you approach it?  What would you do differently if you could do it again from scratch?  Let me know in the comments below.

Cover image created with Microsoft Designer

All original code samples by Mike Wolfe are licensed under CC BY 4.0