Splink: a software package for probabilistic record linkage and deduplication at scale

2:30pm to 3:30pm GMT UK time (6:30am to 7:30am PST) | All sessions will be delivered live and online via the Gotowebinar system.

This webinar is part of the Power of Population Data Science Series

In this seminar, we will introduce Splink, a software package developed for probabilistic record linkage at scale.

This is free software provides a toolkit for record linkage of datasets of tens or even hundreds of millions of records, guiding the user through the various stages of linkage, including:

  • automatic profiling of data, to spot data quality issues that may affect linkage, and skewed fields
  • automatic analysis of different potential blocking rules, to understand the computational costs of different approaches
  • user-customisable rules to compare fields that can be used to model names, dates, locations and any other types of fields
  • estimation of m and u probabilities using various approaches, including the expectation maximisation algorithm
  • diagnostic charts that explain model estimates, and help build intuition for how the model works
  • interactive tools to understand and quality assure the results of record linkage
  • accuracy analysis including ROC and precision recall curves for labelled data

This tool is developed in Python and uses PySpark to enable its use on massive datasets. It has been developed by analysts at the UK Ministry of Justice (MoJ) as part of the Data First programme, and used to link some of the MoJ's largest datasets. The tool is available at https://github.com/moj-analytical-services/splink

View recorded presentation below.

What did you think of this webinar?

Please take a few minutes to complete our online survey. Your feedback will help shape future webinar series!

Speakers

Robin Linacre Robin Linacre is a Data Scientist leading work on data linking methodology at the UK Ministry of Justice. He has a background in econometrics but more recently has worked on a variety of open source software and infrastructure.

Did you miss it?

If you did, it's not too late! 

View all our webinars and more on our YouTube channel

"Population Data BC is a go-to channel for me."
Kay R

What did you think?

Have you watched any of our recorded webinars or presentations?

Please tell us what you think by completing our short survey. Your feedback is very important to us and will help us develop future training courses and webinars.