1. Introduction to the linkage

This report documents the linkage between 2011 Census and the Service Leavers Database (SLD). The SLD is a dataset provided by Ministry of Defence (MoD), containing the records of UK armed forces service leavers who left service between 1975 and 2022. However, for this linkage the SLD is restricted to those leaving service prior to census day 2011 (27 March 2011). This linkage and quality assurance is challenging because of limitations in the SLD variables available for linkage and differences in temporal and geographic coverage between the datasets.

Linkage between SLD and 2011 Census had been conducted before, as published in GOV.UK's Working age UK armed forces veterans residing in England and Wales bulletin. However, because of data retention policies, it had been deleted. The methods for this previous linkage were reviewed; however, it was decided the methods should not be replicated. Because of developments in the field of linkage, the methods previously applied would not meet current linkage best practice. As such, when re-designing a method to reproduce this linkage, more robust linkage and quality assurance methods were used.

Through a combination of deterministic and probabilistic methods, 40.5% of the deduplicated SLD is linked to 2011 Census. The linkage quality is deemed to be low, with precision between 60% and 96% (88% as the middle uncertainty tolerance estimate) and recall between 85% and 99% (96% as the middle uncertainty tolerance estimate). This means that even after clerical review, there is uncertainty whether a record pair represents a match or not, leading to a large range in the quality estimates.

Back to table of contents

2. Dataset quality and coverage

Data quality

The Service Leavers Database (SLD), as a dataset, has limited variables for linkage and has significant missingness in key variables. The variables useful for linkage were limited to forename(s), initials, surname, date of birth, sex and postcode. The level of missingness in forename and postcode (as reported in Table 1) provided a notable challenge for linkage. The missingness in these variables, together with the lack of other geography information, makes linkage on a population level very challenging.

By contrast, the census data had low missingness and was of good quality.

Coverage

In addition to missingness, coverage differences between the SLD and 2011 Census make this linkage challenging. Firstly, the SLD data is historic (ranging from 1975 onwards), meaning there is a high chance that some persons will have died, or emigrated between 1975 and the 2011 Census (meaning they will not be present in the census), and there is higher chance of name changes (through marriage and divorce). Thus, historic data is harder to accurately link. Secondly, there is also a differing geographic coverage between the datasets, with census covering England and Wales and SLD covering UK armed forces service leavers, including those in Northern Ireland and Scotland. The temporal and geographic differences also mean that we do not know how many links are expected, between the datasets, which adds to the challenge of linkage and quality assurance.

Back to table of contents

3. Methods

Pre-processing

The Service Leavers Database (SLD) and 2011 Census underwent standardisation. Cleaning steps included case standardisation, removing punctuation and standardising date formatting. In addition:

  • flags were created for records where names included titles

  • a nickname variable was added to the datasets using the Office for National Statistics's (ONS's) nickname dictionary

  • an indicator of unique biography was derived (where a full name and date of birth combination is unique within census)

  • a military indicator (where a given census record contained an occupation or industry code that related to the military) was also derived

Deduplication of SLD

To deduplicate the data, three different rules were applied. The records with a service exit date after census day 2011 were removed prior to deduplication. For each stage of deduplication, the record with the most present linkage identifiers is retained, and then (where the number of identifiers agreed) the most recent record. Rules were applied in the following order:

  • deduplicating on MODID, a variable provided to us by MOD which indicates the records belong to the same person for multiple periods of service

  • deduplicating on forenames (including middle names), surname, full date of birth

  • deduplicating on forename (no middle name), initials, surname, full date of birth, and postcode; this rule was added after finding that there were many apparent duplicates where middle name was missing

Following this deduplication, 1,603,782 person-level records remained (with service exit date between 1975 and 2011).

Deterministic linkage

The 2011 Census data is deterministically linked with the deduplicated SLD data using 27 matchkeys (details in section 6). Each matchkey consists of a set of rules or criteria that must be met to make a link. To account for expected errors in the data, the criteria are loosened on different linkage variables. Matchkeys were developed through trial and error, investigating the quality of links, and making iterative improvements. They are designed to account for transposed data, input errors and partial errors. Matchkeys are applied hierarchically, starting at the strictest matching criteria and becoming looser.

For cases where one census record linked to many SLD records or vice versa, the link with the lowest (first) matchkey was retained. This allowed the strongest link to be retained. Where conflicting links were made on the same matchkey, the links were both broken because of the inability to distinguish which is the correct link.

The result of deterministic linkage is a total of 611,066 matches with a match rate of 38% (of the deduplicated, pre-2011, SLD).

Probabilistic linkage

Probabilistic linkage was carried out on all records using Splink 2, a probabilistic linkage package, developed by the Ministry of Justice, which uses the Fellegi-Sunter method. Four local models were used to produce m and u values for each of the linkage variables (first name, surname, postcode, day of birth, month of birth, year of birth and sex). m values (agreement weights) are the probability that a variable agrees on two data sources given that they are a true match, so are a measure of data quality - how accurately the variable is recorded or freedom from error. u values (disagreement weights) are the probability that the variable agrees on both data sources given the pair are not a true match, so are a measure of distinguishing power or likelihood of matching by chance.

Probabilistic matching requires comparing each record on one dataset with all records on the other dataset to find a link. This results in a vast number of comparisons; the search space can be reduced by using blocking passes. Blocking passes mean only records which match on one or more specified variables are compared. This results in fewer comparisons being made, but those which are no longer made are ones unlikely to result in a true match. The blocking rules used in each local model are shown in Section 6. A global model was then constructed using the resulting m and u values and run to carry out the linkage.

Deduplication of Splink results was carried out, where the links with the highest match weight, for each census ID and SLD ID, were selected. Following deduplication, thresholds for acceptance of links were established by reviewing a small sample of records. Records with a rounded match weight of 24 (all of these had a match probability of greater than 0.98), were accepted; this decision was made in consultation with the client, balancing the tolerance for false positive against false negative errors. This resulted in 456,283 matches and a match rate of 28.5%.

Integration of deterministic and probabilistic results

The results of the probabilistic and deterministic linkage were joined together. Conflicting links were removed, as we were unable to determine which link was correct. This resulted in the removal of 4,886 conflicting links.

Following integration, there was a total of 649,186 linked records, with a link rate of 40.5%. Of the links, 413,277 (63.7%) were made both deterministically and probabilistically, 195,320 (30%) were made only deterministically and 40,589 (6.3%) were made only probabilistically.

Back to table of contents

4. Quality information

Clerical review

The standard approach to estimate error in the linked data is to perform clerical review (manual checking) on a sample of links and rejected record pairs. In linkage, there is a trade-off between two types of error - precision and recall. Precision (true positives divided by (true positives plus false positives)) is the proportion of the links made that are true matches, whereas recall (true positives divided by (true positives plus false negatives)) is the proportion of true matches that were found.

A sample of record pairs were run through the Data Linkage Hub's Clerical Review Online Widget (CROW) tool for review to clerically detect false positives (incorrect links, where a match has been made that should not have been made) and false negatives (missed links, where a match has not been made, that should have been) on a pair-wise basis.

As the service leavers data was limited, with a high proportion of the data not containing any geography information, the degree of uncertainty in clerical decisions was high. Therefore, the estimates of precision and recall are themselves limited in their accuracy. To capture this, a three-way review was conducted; each pair was reviewed by three different clerical matchers. The results were analysed to calculate best and worst-case precision and recall, as well as intermediate estimates. The clerical reviewers were all experienced and trained in clerical decision making and linkage. They were also briefed on the limitations of this data and given contextual information about both datasets.

For the clerical review, several extra variables (in addition to the name, date of birth, sex and postcode information used in linkage) were added to the data from census to aid decision making:

  • address: the address string as on census

  • unique biography: a flag to indicate whether, for the given census record, the full name and date of birth combination is unique within the census

  • military indicator: a flag to indicate if census occupation and industry codes indicate the census record was a service member at the time of 2011 Census; matchers were instructed to use this as an indicator of a connection to the military, although were told not to give it too much weight

  • name frequency: the standardised full name percentile frequency within census; it is between zero and one, with zero indicating extremely rare names and one indicating extremely common names on census

False positive review (precision)

A clerical review to estimate true positives (correct links, where a link has been made as it should have been) and false positives (incorrect links), is needed to estimate the precision of the linkage. For this review, pairs of linked records were grouped according to a combination of the deterministic matchkey they were matched on, and the probabilistic score the link obtained. The groupings can be seen in Table 2. Using these 15 strata, a total of 10,152 links were selected for clerical review. All links were reviewed three times and were grouped depending on the level of agreement between the clerical reviewers:

  • 3/3 agreement of a match

  • 2/3 agreement of a match

  • 1/3 agreement of a match

  • 0/3 agreement of a match

The classification of records where there was uncertainty depended on the precision bound being calculated. The methods used to calculate the worst case, intermediate and best case for the precision estimates are summarised in Table 3.

The intermediate estimate of precision is 88.1%, where two-thirds of clerical reviewers agreed of either a match or non-match. However, our lower and upper precision estimate bounds are 60.1% and 96.3%, respectively. This range is large, indicating that our clerical review and consequently precision estimate, has a high degree of uncertainty.

False negative review (recall)

A review of true positives and false negatives (missed links) is needed to estimate the recall. For this review, record pairs from below the probabilistic threshold (non-links) were sampled by score region. A total of 7,000 record pairs were reviewed three times. Similarly to the false positive review, results were analysed based on the agreement between clerical reviewers. The results from both the false positive and false negative review were used to estimate recall and uncertainty ranges. 

A total of nine recall estimates were calculated based on the three estimates of true positives from the false positive review and three estimates of false negatives from the false negative review, estimating recall for each combination of the best-case, middle-case and worst-case true positives and false negatives. The method for calculating the worst, intermediate and best-case recall estimate is shown in Table 5.

The intermediate recall estimate was 95.7%, with the worst-case recall being 85.3% and the best case being 99%. This suggests that this linkage has identified between 85.3% and 99% of the links possible within our data.

Uncertainty summary statistics

Table 6 shows there is high uncertainty across both the false positive and false negative review. However, the level of disagreement (and thus uncertainty of decisions) was higher for the false positive review.

Figure 1 shows the uncertainty, broke down by review strata, for the false positive review (for each groups criteria, see Table 2). There was high level of uncertainty in groups 4, 8 and 12. This is notable, as these groups all matched on deterministic matchkeys but did not have a probabilistic match.

Figure 2 shows the uncertainty, broke down by review strata, for the false negative review (for each groups criteria, see Table 4). The level of uncertainty was relatively consistent across groups, with groups that had the lowest match weight (lowest chance of being a match) having the highest consistency.

Bias analysis

It is important to understand if there is linkage bias occurring within this data. Linkage bias is when the applied linkage method is better at capturing people with particular demographic characteristics, such that certain groups are under or overrepresented in the linked data. If unmitigated, linkage bias can lead to biased analytical conclusions.

In the absence of reference statistics, bias analysis is conducted by comparing the linked data with the source SLD data. However, for this linkage there is no accurate estimate of how many records are expected to link between the datasets. It is difficult to know whether differences between the linked and unlinked data reflects linkage failure (including because of data quality limitations) or differences in the coverage of the data.

Some records will not link because of legitimate reasons such as emigration and death, but it is hard to separate out these records from those which have not linked because of linkage error. Therefore, the interpretation of bias in this linkage is complicated.

To understand the potential biases in our linkage, proportional discrepancy was calculated for different variables. Proportional discrepancy is a measure of whether a particular demographic group is under or overrepresented in the linked data compared with the raw data. It is proportional to the overall match rate. It is on a scale of negative one to one; where negative one indicates severe under representation in the linked data, zero indicates proportional representation in the linked data, one indicates severe over representation in the data.

Year of exit and age

An analysis of bias within year of exit was performed, as shown in Figure 3. This shows an under-representation of people who left service between 1975 to 1980 and 1981 to 1986. All other groups are overrepresented in the linked data. This is likely to be because earlier data is of poorer quality and may be less likely to correspond with the 2011 Census. However, this trend could also be because of people having died or migrated prior to census day 2011. Whist these factors are indistinguishable, extreme caution should be taken when using year of exit as any observed patterns by year of exit, in analysis outcomes could be because of linkage bias; and may not reflect true trends or patterns.

Similar patterns were observed when investigating the bias by age group (Figure 4).

Rank

An analysis of bias by rank at exit date revealed a bias towards linking officers, as shown in Figure 5. The rank variable in the SLD denoted the rank at time of exit as either officers (OF1 to 10 including OFD) or other ranks (anyone who is not an officer, OR9 and below).

Other ranks were slightly underrepresented and missingness was highly underrepresented. Missingness may be underrepresented because of a higher likelihood of those records being of inadequate quality for linkage.

This indicates that caution should be used when considering rank within the data and that observed patterns by rank at exit date could be because of linkage bias rather than actual trends.

Rank by age group

As shown in Figure 6, an analysis of bias within other rank by age group revealed a bias towards linking other ranks aged 50 years and younger, and officers across all age groups aged younger than 83 years on census day 2011. Thus, there is an interaction between age and rank, such that people of other ranks, who were over 50 were notably under-represented. Officers in the same age group were overrepresented within the linked data (with the exception the over 83 group). In the under 50 age group, both ranks and officers were overrepresented in the linked data. Those with missingness in the rank variable were underrepresented across all age groups.

Sex

An analysis of bias by sex was conducted but is excluded from this report because of the low numbers of females in the data, leading to their removal from subsequent analysis.

Back to table of contents

5. Summary, recommendations, and limitations

In summary, 40.5% of service leavers database (SLD) records were linked to 2011 Census using deterministic and probabilistic methods. There were severe limitations to the data, and thus severe limitations to the linkage. Intermediate estimates of quality indicate 88.1% precision and 95.7% recall (with large uncertainty of these estimates). There was a bias by age and year of exit, with more recent records having a higher link rate. There was also some bias by rank.

Overall, caution is recommended when considering this linkage, because of the low link rate and precision. Analysts using this data need to be aware of the impact's linkage quality and bias, may have on their analytical findings. It is recommended that any resultant publications are transparent about the limitations of the linkage.

Back to table of contents

7. Cite this methodology

Office for National Statistics (ONS), released 2 August 2023, ONS website, methodology, Service leavers database linkage to 2011 Census

Back to table of contents

Contact details for this Methodology

Hannah O'Dair
linkage.hub@ons.gov.uk