Step 2: Data cleaning

Printer-friendly version

The identifier fields from the new data set then undergo a data cleaning process to standardize the data fields by removing small differences in formatting so that the linkage programs will recognize values that are the same. This process differs depending on the identifiers available in the file, but examples include:

  • Removing the space from postal code.
  • Dates are split into three fields: year, month and day
  • Invalid values are blanked out to be treated as missing
  • Names may have a number of preparation processes applied given the many names and nicknames a person may have over their lifetime, and the frequency with which names are misspelled.  Strategies for matching on name include:
    • Converting the letters to all uppercase
    • Standardizing the character set (replacing accented characters)
    • Removing non-alphabetic characters (dashes, quotes)
    • Keeping multiple name fields for maternal/married name
    • Expanding the names to include nicknames 
    • Rotating name order with multiple first and second (or more) names.

Records from the new data file are then linked to the Population Directory on the basis of common identifiers which are present in both the data file and the Population Directory. The common identifiers used vary based on which identifiers are available in the data, and are selected based on their ability to identify an individual uniquely and reliably. For example, Personal Health Number (PHN), surname, given names, postal codes, birth date and sex are used for linking the Vital Statistics data to the Population Directory; while PHN, birth date, sex, MSP ID and postal code are used for linking the MSP PIM data to the Population Directory.

The goal is to link records belonging to the same individual together, with minimal miss-links (as few as possible linkages being made for records that actually belong to different individuals). 

Page last revised: November 4, 2014