## 1. Overview of data linkage

Data linkage is the process of joining together records that pertain to the same entity, such as a person or business. In its most simplistic form, links (or “matches”) can be completed by comparing unique identifier keys specific to each object, for example, a person’s National Insurance number. Connecting data in this way can lead to new insights and discoveries and reduce the need for additional data collection.

As technology advances and computational power grows, it is possible to link larger datasets more efficiently. However, considerable time and effort must be invested in preparing the data prior to linking, to increase the chance of detecting matches while ensuring accuracy is preserved.

This article discusses seven functions that were created to automate part of the data linkage pipeline to improve efficiency and quality. The tools target the preliminary stages of data linkage, however, can also be used for quality evaluation post-linkage.

Each function was coded in PySpark, a Python-based application programming interface (API) appropriate for data hosted on secure distributed computing systems. Five of the functions perform exploratory data analysis and provide linkage-specific metadata, such as extracting common values and providing general descriptive statistics.

The sixth function creates a “candidate links” table from the data; a detailed record of how each data pair compares across every attribute. This is a necessary step in probabilistic data linkage.

The final function employs the Expectation Maximisation (EM) algorithm, an iterative unsupervised machine learning procedure, to estimate the statistical parameters required for probabilistic data linkage.

Back to table of contents## 2. Main approaches to data linkage

Data linkage is a multi-stage process. It is possible to introduce error at each of these stages therefore the potential for compound error is widely acknowledged (see GUILD: Guidance for Information about Linking Data, Matching and record linkage, Quantitative data cleaning for large databases, Data quality and record linkage techniques for more information).

### Identifying matches

To measure and minimise this error, it is first necessary to quantify the number of missed matches and false matches. Missed matches, also known as false negatives or Type II errors, are when records are not identified as a match when in fact they are. Conversely, false matches (false positives, or Type I errors) are when records are incorrectly identified as a match.

Matches can be identified as follows:

where a predicted positive is an actual positive, that is a “true positive”

where a predicted positive is an actual negative, that is a “false positive”

where a predicted negative is an actual negative, that is a “true negative”

where a predicted negative is an actual positive, that is a “false negative”

Depending on the context, one type of error may be more significant than another. For example, if data linkage is being used in relation to a medical diagnosis it may be favourable to obtain a false positive over a false negative; the latter being a missed diagnosis.

There are various statistics that incorporate these errors and can be used to assess the performance of data linkage. To calculate these, it is necessary to count the number of true and false negatives and positives, which can be recorded in a table known as a confusion matrix.

We can calculate common performance statistics such as recall, precision, accuracy and the F score with the following formulas:

### Main approaches

There are three major approaches to data linkage:

clerical

deterministic

probabilistic

#### Clerical approach

The clerical approach is a manual process whereby a person physically examines each combination of records and determines whether they match or not. Even with two relatively small datasets, this is a lengthy and arduous process, rendering it unrealistic as a primary method.

For example, given two datasets each containing 100 records, the total possible pairs to evaluate would be 10,000. It is also liable to subjectivity and human error, therefore is mostly used to confirm ambiguous matches (see Matching for automatic record matching and linkage and their use in national statistics), or assess true positive rates to calculate recall.

#### Deterministic approach

Deterministic methods employ exact matching techniques, where attributes for a given record pair must match according to some rule. The simplest form of deterministic matching is computationally fast: each attribute in a record pair is compared, and the attributes either match or do not match (see Matching for automatic record matching and linkage and their use in national statistics). For example, in person-related data the attributes may consist of date of birth, name, and address. If each of these attributes were identical then the record pair would be classed as a match.

Given the speed of deterministic linkage, it is often used as a first pass step to remove records from the pool of possible matches, thereby reducing the load for subsequent probabilistic matching.

#### Probabilistic approach

Probabilistic methods comprise the focus of this project and involve the calculation of conditional probabilities to determine the likelihood that a record pair is a match. This method was detailed by a seminal paper on probabilistic data linkage from 1969, and formalised the ideas first presented by Newcombe and others in their 1959 article Automatic linkage of vital records.

For a given attribute, where examples of attributes include dates, addresses, names and so on, two conditional probabilities, *m* and *u*, are calculated. The *m* probability is the probability those attributes match given the record pair belongs to the same unit. This allows for some error in the data. For example, if sex was recorded incorrectly in 5% of record pairs, then the *m* probability for sex would be 0.95 (see Challenges in administrative data linkage for research).

The *u* probability is the probability the attributes match given the record pair belongs to different units. This can be thought of as the probability of chance agreement. Using sex again, a reasonable estimate would be *u* equals 0.5. Agreement and disagreement weights, *W _{A}* and

*W*respectively, are then calculated for each attribute using

_{D}*m*and

*u*as follows in equations 1 and 2.

#### Equation 1

#### Equation 2

The calculated weights are summed for each record to produce a score, where the choice of weight for each attribute is determined by the match status for that attribute. For instance, if date of birth matches for a record pair then the agreement weight is used, if it does not match the disagreement weight is used. An example is presented in Tables 1a and 1b.

#### Table 1a: Example agreement and disagreement weights

Parameter | Forename | Surname | Date of birth | Post code |
---|---|---|---|---|

W_{A} |
7.17 | 7.11 | 6.01 | 4.03 |

W_{D} |
-1.87 | -1.66 | -6.06 | -1.26 |

#### Table 1b: Calculating match scores from agreement and disagreement weights

Record pair | Forename | Surname | Date of birth | Post code | Score |
---|---|---|---|---|---|

Pair 1 | Match W _{A} = 7.17 |
Non-match W _{D} = -1.66 |
Non-match W _{D} = -6.06 |
Non-match W _{D} = -1.26 |
-1.81 |

Pair 2 | Non-match W _{D} = -1.87 |
Match W _{A} = 7.11 |
Non-match W _{D} = -6.06 |
Match W _{A} = 4.03 |
3.21 |

Pair 3 | Match W _{A} = 7.17 |
Match W _{A} = 7.11 |
Match W _{A} = 6.01 |
Non-match W _{D} = -1.26 |
19.03 |

Match scores are calculated for records where the match status is already known, and the frequency distributions of the scores are examined to determine thresholds. Above the threshold, the records would be classed as a match, and below the threshold the records would be classed as a non-match (Figure 1).

The thresholds can then be applied to record pairs with an unknown match status to determine if they are a match or not. It is possible to use multiple thresholds, and class record pairs between these bounds as ambiguous matches. The match status can then be confirmed by clerical review. Thresholds can also be moved to tune the sensitivity of the linkage depending on the application.

### Figure 1: Example of how to evaluate match score thresholds

#### Source: Office for National Statistics – Developing standard tools for data linkage

##### Notes:

- This figure is based on example data and is adapted from a similar figure from John 2019, Cardiff University internal document.

#### Download this chart Figure 1: Example of how to evaluate match score thresholds

Image .csv .xlsGiven the size of many datasets, a technique known as blocking is often employed during probabilistic linkage, where pairs of records that are unlikely to match are removed from consideration to reduce the search space. Grannis and others 2003 liken it to "sorting socks by colour before pairing them". A block is created on a given variable, where all records in the block must agree on that variable. For example, a block on name and date of birth would only contain records where both name and date of birth match exactly. In this sense, a block is essentially a preliminary pass of low-fidelity deterministic linkage.

Of the three approaches to data linkage described, probabilistic methods are the most widely used and are therefore the focus of this project. They are the most computationally expensive, but are more robust against errors and more effective with large databases (see The effect of data cleaning on record linkage quality).

Back to table of contents## 3. Expectation maximisation algorithm

Winkler (2000) proposes a method for estimating *m* and *u* probabilities for probabilistic linkage using the Expectation Maximisation (EM) algorithm. EM is an unsupervised, iterative machine learning algorithm that is relevant to many applications where data are incomplete. In the context of data linkage, the "incompleteness" is because of the fact the true match status for a record pair is unknown.

The EM algorithm is used to calculate maximum likelihood estimates for certain parameters, in this case the *m* and *u* probabilities, and an additional parameter *p*, which is the total proportion of records that match in the entire dataset.

The EM algorithm iterates back and forth between two steps: expectation and maximisation.

### Step 1: Expectation

Given initial guesses of *m*, *u* and *p*, a parameter *g* is estimated, which is a binary indicator variable representing the match status for each record pair (Equation 3). *g* = 1 if the record pair is a match, and 0 if not.

### Step 2: Maximisation

Using *g* calculated in the expectation step, the proportion of matches and non-matches are used to update *m*, *u* and *p* (Equations 4, 5 and 6).

*m*, *u* and *p* are returned to the expectation step to re-calculate *g*.

### Formulas

To formulate this mathematically, we define *g _{j}* as a binary indicator variable representing the overall match status of each record pair:

We also define *γ ^{i}_{j}* as a binary indicator variable, representing the match status of each attribute within each record pair:

Where *i* = 1, 2,…, *n*, represents each of *n* attributes, and *j* = 1, 2, …, *N*, represents each of *N* record pairs in the dataset. We define:

*m̂*as the estimate of the_{i}*m*probability for attribute*i**û*as the estimate of the_{i}*u*probability for attribute*i**p̂*as the estimate for the proportion of record pairs*j*that are matches

The equation for the expectation step to calculate an estimate of *g* for each record pair *j*, is as follows (see Data Quality and Record Linkage) in Equation 3.

#### Equation 3

The maximisation step involves the re-calculation of *m̂ _{i}*,

*û*and

_{i}*p̂*using the following equations (see Data Quality and Record Linkage).

#### Equation 4

#### Equation 5

#### Equation 6

The complete data log likelihood (Equation 7) is calculated at each iteration, and iterations continue until successive values for the log likelihood converge.

#### Equation 7

There are many advantages of using the EM algorithm to calculate *m* and *u* probabilities for probabilistic matching. Firstly, it is an unsupervised learning method, which means there is no requirement to provide a set of training data to "teach" the algorithm before it can be used. If training data were required, then a lengthy clerical review process would first be needed to produce a set of true matches and non-matches.

Secondly, it has been shown to perform well on records with minor typographical errors (see Data Quality and Record Linkage). Finally, it is not overly sensitive to the initial parameter estimates of *m*, *u* and *p* (see Data Quality and Record Linkage), which must be provided before running the algorithm.

Regardless of these initial estimates, convergence usually occurs quickly, albeit that there is no guarantee of finding the global optimum (see Using the EM Algorithm for Weight Computation in the Fellegi-Sunter Model of Record Linkage). The efficiency and accuracy of the EM algorithm highlights it as a valuable tool in the probabilistic linkage pipeline. As such, it formed a target area for this project.

Back to table of contents## 4. Data protection

Data protection and associated ethical obligations are important concepts in record linkage. Some databases may not breach data protection rules in isolation but would post-linkage, for example, after linking a database of medical conditions to name and address details. Confidentiality of personal data is vital and must be strictly observed to maintain trust and comply with legislation. A balance must be sought between protecting privacy while producing statistics at a sufficient resolution to be meaningful. To minimise risk, the tools created as part of this project were developed on synthetic data, which was generated to mimic the statistical properties of real data.

Back to table of contents## 5. Data pre-processing

Data pre-processing, standardisation, or cleaning is the process of forcing data into the same case, format and content, and is an important part of any data analysis. For instance, removal of non-alphanumeric characters, creating consistency in date formats, or parsing (splitting) an attribute into its component parts. This process greatly increases the chances of finding a deterministic match.

For example, comparing the name strings "Smith, Mr Bob", and "Robert Smith" would not result in a match, but if the forenames were standardised, and parsed into forename and surname fields, then the match status would be true (see Data Quality and Record Linkage). There is, however, evidence to suggest that over-standardisation can be detrimental to data linkage (see The effect of data cleaning on record linkage quality). This is because variability can be reduced to the extent that items become indistinguishable from one another, resulting in an increased risk of false positive matches.

Pre-processing includes the identification of missing values. This is particularly important in the context of data linkage because if an attribute is missing from one record then its match status with another record cannot be conclusive. Assumptions can lead to false positives or false negatives, which in turn lead to defective *m* and *u* probabilities in probabilistic linkage.

Derived variables, such as string comparators, are also calculated as part of the pre-processing stage. They provide a numerical "distance" between words based on spelling and are used in data linkage to permit a match even when there are minor differences. For example, an exact comparison between the names "Catherine" and "Katherine" would result in a match status of "false", despite the spellings being almost identical. However, a comparison allowing for a string distance of 1 would result in a match status of "true".

There are many string comparators, but the Levenshtein distance is popular in a data linkage context. This quantifies the number of insertions, deletions, or replacement of characters to transform one string into another. It can be expressed mathematically as follows in Equation 8, given two strings, *a* and *b*, where *lev*_{a,b}(*i*, *j*) is the Levenshtein distance between the *i* characters in *a*, and *j* characters in *b*.

#### Equation 8

*lev*_{a,b}(*i* -1, *j*) + 1 represents a deletion, *lev*_{a,b}(*i*, *j* -1) + 1 represents an insertion, and *lev*_{a,b}(*i* -1, *j* -1)+ 1_{(ai ≠ bj)} represents a match or mismatch.

1_{(ai ≠ bj)} is a binary indicator variable equal to 0 when *a _{i}* =

*b*, and 1 otherwise.

_{j}As an example, the Levenshtein distance between "Amy" and "Aimee" is three. This is because the transition requires two insertions ("i" and "e") and one replacement ("e" for "y").

Another variable that can be derived from string data is a phonetic code. As with string comparators, phonetic codes provide flexibility when comparing strings by allowing for slight differences in the data. This is achieved by re-coding strings according to their pronunciation. There are many types of phonetic codes, but Soundex is widely used.

A Soundex code consists of the first letter of the string followed by three digits, where each digit represents one of six consonant sounds. The first letter of the string is retained, any vowels dropped, then remaining consonants represented by a number between one and six. For example, the Soundex code for "Amy" is A500, which is identical to the Soundex code for "Amiee". Comparing the Soundex code would enable a match on these names, which would not be possible with exact matching.

A final part of data pre-processing essential to probabilistic linkage is the generation of a candidate links dataset. This is a table that records the match status of each attribute for a given record pair.

Derived variables can also be compared, such as string distance measures or phonetic codes. In the case of a string distance measure, the output is an integer, which must then be converted to true or false for inclusion in a candidate links dataset. It is therefore necessary to select a threshold, above which the match status should be considered false. The threshold can be changed depending on the requirements of the given linkage project.

An example candidate links dataset is shown in Table 2. It compares records where derived variables have been used in place of forename and surname, but an exact comparison has been used for date of birth. For instance, Pair 3 may comprise:

forename Soundex codes: A500 and A500 (match status equals true)

surname Levenshtein distance: 0 (match status equals true)

date of birth: 02/11/2020 and 10/11/2020 (match status equals false)

Candidate link pair | Forename (Soundex comparison) | Surname (Levenshtein distance) | Date of birth (exact comparison) |
---|---|---|---|

Pair 1 | False | False | False |

Pair 2 | False | True | True |

Pair 3 | True | True | False |

Pair 4 | False | False | False |

#### Download this table Table 2: Example of candidate link dataset

.xls .csvThe main purpose of creating a candidate links dataset is to record the match status of each attribute to select the correct weight, either agreement or disagreement, when calculating match scores. However, it is also a requirement of the Expectation Maximisation (EM) algorithm when iteratively calculating *m* and *u* probabilities, as the true or false status is represented by the variable *γ ^{i}_{j}* (see Section 3). It is therefore a necessary step in the probabilistic linkage pipeline and so formed another target area for this project.

## 6. Study objectives and approach

The objectives of this study were to increase the efficiency of the probabilistic data linkage pipeline by creating custom software tools to automate certain tasks. The tools were required for use at the Office for National Statistics (ONS). Commercially available software packages were not appropriate given they are often context-specific and provide limited transparency and control for the user. There may also be concerns over data security.

Three essential steps in the probabilistic linkage pipeline were selected for focus where automation is most widely useful:

step one: generating linkage-related metadata

step two: creating candidate link datasets

step three: implementing the Expectation Maximisation (EM) algorithm to estimate

*m*and*u*probabilities and their associated agreement and disagreement weights for use in probabilistic linkage

In relation to the first point, examples of linkage-related metadata include the number of records in each database, the total link space (total number of possible comparisons), quantification of missing data, recognising unique identifiers for use as match keys, and generating general descriptive statistics.

Automating step one delivers consistency and speed in identifying such data features but can also be used to provide quality evaluation post-linkage. Automating the production of a candidate links dataset for step two provides the required input for step three where *m* and *u* probabilities and agreement and disagreement weights are estimated.

## 7. Methodology used

The Office for National Statistics (ONS) collects and stores a vast quantity of data relating to the UK economy, population and society. This necessitates the use of a sophisticated distributed computing system for storage and data analysis. Because of the sensitive nature of the data, it must also be highly secure.

To achieve this end, the ONS uses Apache Spark technology, which enables efficient handling of large datasets by sharing the workload across multiple networked computers. To uphold a high level of security, the system is only accessible from the ONS offices and users must have security clearance.

Additionally, it is not possible to create a user interface for applications running on the system or run code through the command line. Any functions created as part of this project had to adhere to these conditions, and as such outputs were provided as "raw" code in the form of *.py* scripts. However, this has the dual advantage of full code transparency and adaptability for future use.

The traditional language of communication with Apache Spark is Scala; however, Python and R Application Programming Interfaces (API's) can also be used in the form of PySpark and SparklyR. PySpark was selected for this study because it is widely used within the ONS and has a straightforward syntax. The documentation is thorough, and implementation is widely discussed by the online community. A disadvantage, however, is efficiency: operations in Scala are up to 10 times faster than their PySpark equivalents (see Scala vs. Python for Apache Spark). That being said, the syntax of Scala is much more complex than Python, therefore would negate the requirement to provide accessible, transparent code.

All functions created for this project were initially tested on synthetic data, which are fabricated datasets that mimic the structure, content and errors of real data. Such datasets are ideal for testing analytical processes, since they can be "damaged" in a controlled way and then compared with the original version. False positives and false negative rates are known and therefore can be used to accurately calculate precision and recall without the need for clerical matching. Synthetic data also have no risk associated with data protection.

The first step in this study was to undertake a thorough literature review to fully understand the processes involved in data linkage, and to help in selecting which parts of the pipeline to target for automation. Discussions were then had with important stakeholders to further refine requirements, after which pseudocode was developed to map out each function in more detail. The pseudocode was then implemented in PySpark, where each stage was tested to ensure results compared with expectations.

It was then wrapped into functions and further tested on the synthetic datasets, comparing outputs with the relevant non-damaged files where appropriate. Finally, functions were released to the Data Linkage Expert Group at the ONS for user acceptance testing on real data. Any issues were resolved and returned to the user for final testing and checking.

Code was written with speed and efficiency in mind throughout development: for example, all operations were performed on PySpark dataframes, with spurious columns dropped within functions to reduce computational expense. Where possible, function outputs are given in Pandas dataframe format.

Pandas is a flexible Python library that allows manipulation and analysis of data tables. Pandas dataframes offer a wider range of functionality compared with PySpark dataframes, such as extraction of data using row indexing. They can also be rendered to Hypertext Mark-up Language (html), which is more aesthetically pleasing and affords clarity to the output. However, they are unable to deal with extremely large datasets, hence data manipulation is performed on PySpark dataframes and only the output provided in Pandas format.

All functions contain doc strings detailing the purpose, input parameters and outputs to assist the user in running the code. Additionally, a user guide was developed, which explains how to use each function in detail.

Back to table of contents## 8. Main functions

A total of seven functions were developed as part of this project and are summarised in this section.

### General summary function

Purpose: Provide general linkage-specific metadata for multiple dataframes.

Output: Pandas dataframe tabulating metadata for given input dataframes.

### Missing function

Purpose: Summarise the missingness per column. Enables user to make informed decisions on imputation.

Output: Pandas dataframe tabulating missingness of each attribute.

### Unique function

Purpose: Summarise the distinct values per column. Assess suitability of each column for use as identifier keys and blocking variables.

Output: Pandas dataframe tabulating uniqueness of each attribute.

### Common records function

Purpose: Summarise the number of distinct records in common between two input dataframes. Assesses potential bias in linkage.

Output: Pandas dataframe tabulating distinct records and associated counts.

### Descriptive statistics function

Purpose: Provide descriptive statistics per column for each dataframe. Quality evaluation post-linkage.

Output: Pandas dataframe tabulating descriptive statistics.

### Candidate links function

Purpose: Generate a candidate links dataset from random samples of two input dataframes. For deterministic and probabilistic linkage, plus anonymisation.

Output: PySpark dataframe showing Boolean match status per attribute, and Levenshtein distance for strings

*m* *u* probability function

Purpose: Iteratively estimate *m* and *u* probabilities and associated agreement and disagreement weights using the EM algorithm.

Output: Pandas dataframe tabulating *m* and *u* probabilities and agreement and disagreement weights for each attribute.

The prior assumptions for running the functions are as follows:

the PySpark session has been configured (the current version of PySpark, PySpark 3.0.1, requires Python 2.7 and Pandas 0.23.2 as a minimum)

each dataset for linkage has been loaded as a separate PySpark dataframe

the PySpark dataframes have a "flat" structure: they are simple two-dimensional tables consisting of single rows of data with no nested information

Most of the diagnostic functions take an arbitrary number of input databases where the upper limit is restricted by computational power. Larger datasets have slower run time; however, the exact run time will depend on the computational resources such as the configuration of the PySpark session. Run time can be minimised by limiting the number of dataframes passed into functions and dropping any spurious attributes.

Back to table of contents## 9. Details of functions

### Diagnostic functions

Five of the functions can be classed as diagnostic functions, as they provide metadata about the datasets provided for linkage. These five functions are:

General summary function

Missing function

Unique function

Common records function

Descriptive statistics function

These are described in detail in this subsection.

The general summary function gives a broad overview of the datasets passed to it. It can assist in finding compatible variables that could be used as match keys, while identifying areas than may require data cleaning or reformatting. This tool has the added advantage that it can be applied outside data linkage to give a quick summary of the structure and content of any two-dimensional dataframe.

Any comparisons completed by the function, such as searching for common column names, are conducted across all input dataframes. For instance, if four dataframes are passed as inputs, of which only three have identical column names, then no commonality will be detected. The objective of the general summary function is to provide a broad overview of the dataframes of interest.

Subsequent functions can then be used to provide more detail on certain aspects. Two or more dataframes can be passed as arguments, and the output is a single Pandas dataframe containing the following attributes:

the total link space stating the total possible comparisons between all records across all input dataframes

**File_Name**: String – the names assigned to the input dataframes when loaded**Row_Count**: Integer – the number of records in each dataframe**Column_Count**: Integer – the number of columns in each dataframe**Common_Columns**: List – the common column names between all dataframes in alphabetical order**Common_Columns_All_Data_Types**: List – the common column names across all dataframes and their associated data types; each item in the list is a tuple of (column name, data type) in alphabetical order and this allows identification of columns where names may be shared but data types are not, highlighting where data types require changing such that they are compatible for linkage**Common_Columns_Common_Data_Types**: List – the common column names across all dataframes, only where the associated data type is also in common; each item in the list is a tuple of (column_name, data_type) in alphabetical order by column name**Unique_Columns**: List – the column names unique to each dataframe in alphabetical order**Empty_Cells_Count**: Integer – the number of empty cells per dataframe**Empty_Cells_%**: Float – the proportion of empty cells per dataframe as a percentage of all cells in the dataframe

The missing function identifies the quantity and location of missing data and rapidly allows the user to determine imputation requirements. As with the general summary function, it also has applicability outside the field of data linkage.

Imputation is beyond the scope of this project, but this function provides sufficient information for targeting missing fields should this be necessary, in addition to understanding the impact imputation may have on the structure of the data.

One or more dataframes can be passed as arguments in any order, and the output is a single Pandas dataframe, listing the following attributes against each input column name:

**Count_Missing**: Integer – the count of missing cells per column**Percentage_Missing**: Float – the proportion of missing cells per column as a percentage; this is calculated by dividing the count by the total number of rows in the given column and the result is given to six decimal places as this degree of accuracy may be relevant for large datasets with a low number of missing cells**File**: Integer – the corresponding file (dataframe) number for that row; this allows differentiation between column names that may be common to multiple dataframes

The unique function identifies if any variables have distinct values. It can be used to evaluate the potential of each variable for use as a match key in deterministic linkage, and to inform decision-making on blocking criteria.

Most datasets are unlikely to contain unique identifiers, therefore the function supplies information on how close each attribute is to being unique. Fields can be concatenated to derive a new match key from multiple non-unique match keys. These derived variables can be passed to the function a second time to test for uniqueness.

The unique function can also inform on blocking criteria. For example, blocking on a variable that has a high distinct count would result in very small blocks and therefore a higher risk of missing potential matches. Equally, a variable with a low number of distinct values would result in very large blocks, which would not substantially reduce the total link space. One or more dataframes can be passed as arguments in any order.

The output is a single Pandas dataframe providing the following information for each input column:

**File**: Integer – the corresponding input dataframe number for that row; this allows differentiation between dataframes, which share column names**Distinct_Values_Count**: Integer – the count of distinct values per column, if duplicates were removed, for example, day of the month could reasonably be expected to have 31 distinct values, and month of the year to have 12; deviation from this would warrant further investigation or indicate that fields may have been swapped (for instance, days recorded in the months column)**Distinct_Values_%**: Float – the proportion of distinct values per column as a percentage of non-null values only; the count of the distinct values is divided by the number of non-null entries in the given column then multiplied by 100**Distinct_%_Standardised**: Float – The proportion of distinct values per column as a percentage of all values, including nulls; the count of the distinct values is divided by the total number of rows in the dataframe, then multiplied by 100 and this figure is useful in the context of columns with many missing values where Distinct_Values_% would be misleading**Count_Per_Distinct**: Float – the count of non-null values divided by the count of distinct values provides a rough indication of the number of times each distinct value occurs; if Count_Per_Distrinct equals 1, all non-null items are distinct, so each value in the column is unique**Use_As_Identifier**: Boolean – a unique identifier key requires the number of distinct values to equal the number of non-null values; in this case, Use_As_Identifier equals True and this is unlikely with real data, therefore Distinct_Values_% can be used as an indication of how close the column is to being unique

The common records function returns values that only exist in both dataframes for the given column, automatically removing duplicates. It is an extension of the unique function in that it provides the actual distinct values themselves in addition to quantities. It can help establish linkage quality as values that are not shared between datasets can result in linkage bias.

Only two dataframes can be passed to the function as it uses the PySpark intersect method, which requires two positional arguments. The function will only perform on columns that share both name and data type: the general summary function can be used first to identify these columns.

For each column name common to both input dataframes, the output details:

**Common_Distinct_Records**: List of strings – this returns values that are in both dataframes for the given attribute, automatically removing duplicates**Common_Distinct_Records_Count**: Integer – a count of values that are only in both dataframes; this is the length of the list Common_Distinct_Records

The descriptive statistics function is used to provide general statistics on the input dataframes. It can also be used post-linkage for quality evaluation by comparing outputs between individual and linked datasets. If linkage has been conducted adequately, then distributions should be preserved post-linkage.

It is also a useful tool outside the field of data linkage, as many data analysis projects require the extraction of general descriptive statistics as part of exploratory data analysis. The first argument that must be passed to this function is *error*, a float greater than or equal to 0 and less than or equal to 1. This dictates the acceptable error threshold when calculating quantiles.

As error approaches zero, run time increases, which is why the acceptable error is left to the user's discretion. After *error*, the use must pass one or more dataframes in any order. The output is a single Pandas dataframe, which details the following for each column:

**Data_Type**: String – the data type for the named column**Count**: Integer – the count of non-null values per column**Min**: Integer or float – the minimum value for the named column**Max**: Integer or float – the maximum value for the named column; useful for range checks, for example, a value greater than 12 in a month column would warrant further investigation**Mean**: String, integer or float (the output data type will be the same as the input data type) – the mean value for the named column, not applicable for string data**25th_Percentile**: Integer or float – the 25th percentile for the named column, calculated using the acceptable error threshold,*error*, specified by user when calling the function; not applicable for string data**Median**: Integer or float – the median value for the named column, calculated using*error*; not applicable for string data**75th_Percentile**: Integer or float – the 75th percentile for the named column, calculated using*error*; not applicable for string data**Std_Dev**: String, integer or float (the output data type will be the same as the input data type) – the standard deviation for the named column; not applicable for string data**Variance**: String, integer or float (the output data type will be the same as the input data type) – the variance for the named column; not applicable for string data**Frequency_List**: List of tuples of type (string, integer), where the string is the nominal value and the integer is the associated count; not applicable is returned for numerical data and the list is in descending order of frequency, therefore the first item in the list is the mode**StdDev_of_Frequency**: Float – the standard deviation of the frequencies for nominal values; not applicable is returned for numerical data, or data for which there is only one nominal value, for example, if all participants in a survey were male

All functions described are performed directly on the PySpark dataframes to optimise speed, with only the outputs sent to a Pandas dataframe. In Pandas format, row indexing can then be used to extract specific information as required.

### Candidate links function

The candidate links function produces a candidate links dataset as a PySpark dataframe. The purpose of the function is twofold: it is a vital step in both deterministic and probabilistic linkage, but also a method of data anonymisation.

Outputs are Boolean such that they can be used in the *m* *u* probability function for probabilistic matching, given the Boolean outputs directly translate to the binary indicator variable *γ ^{i}_{j}*, necessary for use in the Expectation Maximisation algorithm (see Section 3).

The following arguments must be passed to the function.

#### Sample

Float greater than 0 and less than or equal to 1. The user-defined proportion of records required from each dataframe. Samples are selected by PySpark according to the Bernoulli distribution. Sampling is necessary because the total possible combinations of record pairs for large datasets is beyond the capability even of an advanced distributed computing system.However, if the dataframes for linkage are small then the entire link space can be retained by passing sample equals 1.

#### Similarity

Float greater than 0 and less than or equal to 1. The acceptable similarity between strings to class them as a match, where 1 indicates the strings must be identical. It is used within the function to calculate a Levenshtein distance threshold according to Equation 9, which also accounts for varying string lengths (see Identifying clones in dynamic websites using similarity thresholds).

#### Equation 9

If the Levenshtein distance is less than the threshold, then the agreement status is true. A similarity of one reduces the Levenshtein threshold to zero, which means the strings must be identical to be a match.

#### df1 and df2

PySpark dataframes intended for linkage. Only two must be passed.

#### Identifier

String. Name of column that contains unique identifier (optional argument). If the input dataframes have unique identifiers, then the names of these columns can be passed such that they are retained in the output.

The function runs by extracting a sample from each dataframe according to the proportion specified in the sample argument. This accounts for differences in size between the datasets, however, does mean that the total link space will be a function of the size of the smallest dataset. The results are randomised because of the non-indexed nature of PySpark dataframes.

The function then appends a suffix of either _1 or _2 to the column names of each sample, such that they are distinguishable when comparing the values between each dataframe. Comparisons are recorded in a new column before the original columns are dropped. In this way the output ensures data protection.

Fields for which candidate links are desired must have the same column name and data type in both input dataframes. If not, the function assumes these columns are not required and they are dropped. For example, if *df1* contains a column called "Date_of_Birth" and *df2* contains a column called "DoB" then one name should be selected, and the other column renamed for consistency. The requirement to rename columns and change data types can be efficiently identified using the diagnostics functions. This is not a process that can be automated as column names and data types are context specific.

The output of the candidate links function is a single PySpark dataframe. The column headers depend on the names of the columns of the input dataframes; however, consistent suffixes are applied as follows:

**identifier_1 and identifier_2**: String – if the optional argument*identifier*is passed, then the first two columns of the output will be the unique identifiers from the original dataframes, with the suffixes "_1" and "_2" respectively**Column_bool**: Boolean – the original column name with the suffix "_bool" to indicate whether these columns are an exact match or not**Column_soundex_bool**: Boolean – the original column name with the suffix "soundex_bool" to indicate if there is a match on the Soundex phonetic code; Soundex codes are only calculated for strings and are dropped within the function to retain privacy**Column_lev**: Integer – the original column name with the suffix "_lev", which represents the Levenshtein distance between two strings, which is widely used in data linkage; this is only calculated for string data types**Column_lev_bool**: Boolean – the original column name with the suffix "_lev_bool", which converts the Levenshtein distance to a Boolean, specifying whether the string matches (true) or does not match (false); the threshold Levenshtein distance is controlled by the*similarity*argument when calling the function

Each output row represents a record pair comparison. If either record has a null value, the resulting comparison will also return null to avoid generating false negatives from missing data (it is not safe to assume a false match simply because a value is missing).

*m* *u* probability function

The *m* *u* probability function uses the Expectation Maximisation (EM) algorithm to calculate *m* and *u* probabilities and their associated agreement and disagreement weights, which are essential parameters in probabilistic linkage. This method is complex to implement manually, therefore the function affords significantly improves efficiency for the user. The arguments passed to the function are as follows:

#### Candidates

PySpark dataframe. It is recommended to use the output from the candidate links function from a formatting point of view, and because the candidate links are generated from random samples so are assumed to be representative of the entire link space. Any columns that are not in Boolean format will be dropped by the function. Any rows containing null values are dropped to avoid false negatives. The user is notified with a warning of any data that have been dropped.

#### iterations

Integer. The maximum number of iterations to be performed by the EM algorithm.

#### tol

Float between zero and one. The convergence tolerance required. This is the minimum acceptable difference between the complete data log likelihood at each iteration.

The EM algorithm requires initial parameter estimates for *m*, *u*, and *p* to initiate the expectation step (see Section 3). These are built into the function rather than selected by the user, as the algorithm is not overly sensitive to initial estimates (see Data Quality and Record Linkage). Jaro 1989 recommends that the initial value of *m* > *u* and uses *m* = 0.9 therefore this has been chosen going forward. The initial values of *u* and *p* are 0.2 and 0.5 respectively as per Analysis of a Probabilistic Record Linkage Technique without Human Review.

The EM algorithm is an unsupervised machine learning algorithm, therefore requires every column to contain some combination of both true and false values so that there is variety for the algorithm to "learn" from. For example, if all comparisons in the Forename column are false then *m* = *u* = 0, as the probability of having a match given any conditions will always be 0. This will result in an agreement weight of negative infinity and disagreement weight of 0.

Similarly, if all values are true then *m* = *u* = 1, the corresponding agreement weight would be 0, and disagreement weight would be negative infinity. It is possible that the values for m and u probabilities legitimately approach 0 and 1 as the algorithm runs. Because of limits on precision in any computational system, and subsequent rounding, this can result in 0 or infinite weights as described previously. The m u probability function handles this prior to calculating agreement and disagreement weights by replacing zero with 0.0000000000000001 and one with 0.9999999999999999.

Convergence is based on the complete data log likelihood (equation 7), which is calculated and stored for each iteration such that successive iterations can be compared. The use of complete data log likelihood as a convergence is well recognised (see Probabilistic record linkage with the Fellegi and Sunter framework, Data Quality and Record Linkage, Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida, and Maximum Likelihood from Incomplete Data via the EM Algorithm for more information).

The *m* *u* probability function allows the user to control the precision of the output by providing both an acceptable convergence tolerance and a maximum number of iterations. The function terminates at whichever is reached first.

The output of the function is a single Pandas dataframe providing the following detail for each attribute of the record pair:

**m**: Float – the*m*probability for the given attribute; the probability the attributes match given they belong to the same unit**u**: Float – the u probability for the given attribute; the probability the attributes match given they relate to different units, or the probability of chance agreement**W_a**: Float – the agreement weight for the given attribute,**W_d**: Float – the disagreement weight for the given attribute,

There are two main points that should be taken into consideration when using this function: firstly, the assumption of conditional independence in the Fellegi-Sunter method of probabilistic linkage. This is potentially violated given the user is relied upon to select the columns for use in the function. It is necessary for this to be user controlled, however, given content will vary for different datasets and applications.

Secondly, it is vital that the same attribute is not represented more than once in the input candidate links dataset, or convergence will not occur. This is best illustrated by way of an example: if the candidate links dataset contains surname_boolean, surname_soundex_boolean, and surname_lev_boolean columns, then only one of these columns should be retained as they are all represent the same attribute; in this case surname.

Back to table of contents## 10. Conclusions

Data linkage allows existing data to be re-purposed, which can lead to new discoveries in many fields of research. It is an effective, low-cost approach that can reduce the requirement to collect new data and therefore be less intrusive than traditional survey methods. Robust functions that improve the efficiency and consistency of the process have strong implications for best practise: personnel are no longer required to formulate context-specific code on a case-by-case basis, increasing overall productivity and quality of output. It also makes complicated methods more accessible to non-experts. Additionally, the provision of commented code delivers transparency in approach.

This project successfully achieved the study objective to create functions that automate and improve the preliminary stages of the data linkage pipeline. Seven functions were produced, which generate linkage-related metadata, candidate links and weights for use in the Fellegi-Sunter method of probabilistic record linkage. The latter function involves the practical application of linkage theory widely accepted in the literature (see Data Quality and Record Linkage and Using the EM Algorithm for Weight Computation in the Fellegi-Sunter Model of Record Linkage) by invoking the Expectation Maximisation (EM) algorithm to estimate *m* and *u* probabilities. This function has potential for expansion into automated machine learning methods of determining matches and non-matches. For instance, the use of clustering techniques to automatically determine match status, rather than manually defining score thresholds.

One limitation of the project functions is that it was not possible to automate certain processes because of a lack of standard formats and naming conventions in different databases. For example, the candidate links function will only execute on columns that share both name and data type. This may necessitate manual re-naming of columns and conversion of data types prior to running the function. Yet this does have the benefit of wider application to a range of different database types, as there is no dependency on context. Another limitation is that the functions assume data are structured and have been cleaned to a certain extent. However, the diagnostic functions can be used to highlight the need for cleaning and rule-based formatting.

Feedback from users of the functions has been positive in respect of expediting the data linkage process. There has also been wider interest in the application of the diagnostic functions for general exploratory data analysis outside the field of data linkage.

Back to table of contents