Step 3: Ensuring accuracy
Identifiers may not be unique to an individual (e.g. postal code) or may change over time (e.g. surname). Identifying information may also be recorded inconsistently, incorrectly or may even be missing in certain records. Because of this degree of uncertainty, two linkage techniques are used at PopData to ensure that linkage is as accurate as possible.
- Deterministic Linkage: A linkage technique whereby links between records are determined based on the perfect match of a set of common identifiers, or, using more flexible rules, the match of a subset of the identifiers. The advantage of this method is that it minimizes the miss-links between the two databases; however, the disadvantage is that if this method is used just by itself, each identifier is considered to be of equal importance and quality.
- Probabilistic Linkage: In probabilistic linkage, the identifiers are given weights according to how ‘strong’ of an identifier they are. For example, it is much more likely that two records will match on sex than on last name. Thus, last name is considered a stronger identifier and is assigned a higher probabilistic agreement weight. The matches found in the common identifiers and the weights given to those variables are used to estimate the likelihood that the records belong to the same individual. The advantage of this method is that linkages are maximized, even in cases where data may be incomplete or have coding errors; the disadvantage is that, unless care is taken, there may be some miss-links.
At Population Data BC, both deterministic linkages and probabilistic linkage techniques are performed.
Deterministic linkage is performed first using a computerized program that compares all records in the new data file to all records in the Population Directory on the basis of the common identifiers (also called linkage fields). However, because there are often upwards of 40 million records in the data files that PopData links, it is inefficient to compare all 40 million records to each record in the Population Directory. To reduce these inefficiencies, PopData compares records within pockets of data. For example, three different pockets might be used when linking one data set; a NYSIIS (phonetic code) name pocket, a birth year/birth month pocket, and a PHN pocket. For each pocket, only records that match on that linkage field are compared. For example, in a birth year/birth month pocket, only records in the new data file that match to the Population Directory on birth year and birth month are compared using the deterministic linkage program. In the end, the information from all the pockets are put together to determine the best match.
For each pocket, the deterministic linkage program produces an outcome string for each potential match. The outcome string records whether there was a perfect match, a complete mismatch, or a partial match. For example, if 6 identifying variables are involved in the linkage, the outcome string will have 6 digits, one for each identifying variable, indicating if there was a perfect match (1), a complete mismatch (9), a partial match (values 2-6), or if the value was missing (0). For example, if there was a match on first name, last name, sex, birth year and birth month, but not on birth day, the outcome string would be 111119. An example of a partial agreement is the first three characters matching in postal code.
All candidate matches from the deterministic linkage program are then fed into the probabilistic linkage program along with a set of probabilistic weights. The probabilistic weights are contained in a ‘Link Weight Parameter File’ and consist of an agreement weight, a disagreement weight, a partial agreement weight for each level of partial agreement and a missing weight for each of the linkage fields. The Link Weight Parameter File is generated based on actual frequencies of agreement/disagreement of each linkage field in the data using an iterative process that is usually run on a subset of the actual data file. In addition to the parameter weights, value specific weights are generated for some of the linkage fields. These weights are generated using the Population Directory and consist of one file for each of the linkage fields. Value specific weights are assigned depending on how rare certain values of that variable are. For example, a value specific weight file created for given name would contain all possible given names found in the Population Directory. Common names are given a low weight while rare names are given a high weight. Not every linkage field has a value specific weight file (sex being such an example, because each value has a similar frequency - approximately 50/50 male/female).
The output from the probabilistic linkage program contains, for each potential match, a final weight for each of the linkage fields that is equal to the parameter weight multiplied by the value specific weight. The value specific weight can be thought of as a modifier to the parameter weight. For example, if there is a match on postal code, the agreement weight will be multiplied by the value specific weight, so that agreement on a rare postal code is assigned a higher weight than agreement on a common postal code. The total weight for each potential match is the sum of the final weights for each linkage field, and reflects the probability that the two records refer to the same individual.