1. Disclaimer

These Research Outputs are not official statistics. Rather they are published as outputs from a proof of concept feasibility study exploring the use of administrative data linked to the 2011 Census. These outputs should not be used for policymaking or decision-making. This work uses research datasets that may not exactly reproduce National Statistics aggregates.

It is important that the research presented here be read alongside the quality and methodology information in Section 7 to help interpretation and avoid misunderstanding. These outputs must not be used without this disclaimer and warning note.

This work contains statistical data from the Office for National Statistics (ONS), which is Crown Copyright. The use of the ONS statistical data in this work does not imply the endorsement of the ONS in relation to the interpretation or analysis of the statistical data. This work uses research datasets that may not exactly reproduce National Statistics aggregates.

Back to table of contents

2. Main points

  • There is currently a need for research to better understand the barriers and gateways to social mobility to inform public policy targeted at disadvantaged children and young people.

  • This report contains outcomes of analysis and descriptions of the linkage methodology used.

  • This Proof of Concept (PoC) dataset brings together information relating to personal characteristics of children and their family members with their educational attainment; this enables insights to better understand the effect of factors such as personal and familial characteristics, and geography on educational outcomes.

  • The analysis of the PoC datasets was conducted on personal characteristics, educational attainment, household characteristics, vulnerable groups and geography.

  • Deterministic linking was used to match the 2011 Census and a bespoke extract of the feasibility AEDE to form the Growing Up in England (GUIE) dataset; a high linkage rate of 90% was achieved.

  • Prior to linkage, a series of preparatory steps were taken; these are described in further detail in Section 7: Quality and methodology.

  • Any limitations and issues identified will be considered in future iterations of the GUIE dataset.

Back to table of contents

3. Things you need to know about this release

In our role as the largest producer of independent official statistics in the UK, the Office for National Statistics (ONS) provides the data and analysis that help us understand how people in the UK experience life. This information has traditionally come from surveys or the census, however, these sources often cannot provide enough detail or timeliness. For this reason, we are developing our use of administrative and linked data, and this article presents research into how census-linked educational datasets might be used in the future.

The Growing Up in England (GUIE) dataset was produced in partnership with Administrative Data Research UK (ADR UK) as part of the Data for Children Partnership. The data were accessed for research purposes through the ONS’s Secure Research Service (SRS).

This analysis has been produced in collaboration with the ONS Centre for Equalities and Inclusion. The aim of this centre is to work with other researchers to ensure that the right data are available to address the main social and policy questions about fairness and equity in society.

These research outputs are not official statistics, however these new data have the potential to give us far better insight into some of the factors affecting educational attainment. It is important to understand that the Proof of Concept (PoC) dataset created from this innovative linkage project into educational attainment and progression has some limitations and issues. Any disparities between groups in the level and progression of attainment presented in this research may be because of other characteristics rather than the one being directly measured and compared. The outputs are published to demonstrate the type of analysis possible using administrative data and the ADR UK investment. As such, these results should not be used to draw conclusions about educational outcomes of children and their corresponding characteristics, and instead are illustrative of the population sizes we expect to capture within the data.

Back to table of contents

4. The Data for Children Partnership

The Data for Children Partnership is a strategic partnership between Administrative Data Research UK (ADR UK), the Office of the Children’s Commissioner for England (CCO) and other parties such as academics, charities and government departments. The objectives of the partnership are to:

  • ensure policy that concerns children and young people in England is informed by high-quality, relevant data

  • unblock barriers to data sharing through ensuring relevant data collection and raising the profile of the importance of sharing and linking data from across the government

  • deliver impact through data from contributing to intelligence-led policy and achieving demonstrable outcomes from research

ADR UK is a partnership transforming the way researchers access the UK’s wealth of public sector data to enable better informed policy decisions that improve people’s lives; it is formed by three national partnerships (ADR Scotland, ADR Wales, and ADR NI), and the ONS. Ultimately, ADR UK is creating a sustainable body of knowledge about how our society and economy function by linking together data held by different parts of the government, and by facilitating safe and secure access for accredited researchers to these newly joined-up data sets. These are tailored to decision makers’ needs, to provide answers required to solve policy questions.

We provide a range of services for ADR UK, specifically:

  • We acquire permission from data suppliers for their administrative or survey records to be used for research purposes; the legal gateway for suppliers to provide access to their data through the ONS is the Statistics and Registration Service Act (SRSA) 2007 as amended by the Digital Economy Act (DEA) 2017.

  • As a data processor we clean, link and de-identify the data according to specifications defined by the researchers prior to the data being made accessible; key words such as data linkage and de-identified are defined in Annex A.

  • The SRS gives accredited or approved researchers secure access to de-identified, unpublished data in order to work on research projects for the public good and provides a safe setting, as part of the Five Safes Framework to protect data confidentiality (the framework is a set of principles adopted by a range of secure labs, including the ONS; most datasets are available to access through secure remote access to the SRS and in some instances, the data can only be accessed from an approved safe setting)

The Office of the Children’s Commissioner for England (CCO) is responsible for speaking up for children and young people so that policymakers and the people who have an impact on their lives take their views and interests into account when making decisions about them. The CCO is developing a vulnerability database to shine a light on the extent and impact of child vulnerability in England.

The purpose of this report is to present the research outcomes of a feasibility study on a dataset that links a feasibility version of the All Education Dataset for England (AEDE) to the 2011 Census by describing the datasets used for linkage, analysis, linkage methodology, and next steps.

The Proof of Concept (PoC) analysis discussed in this research output was carried out under the third objective of the Data for Children Partnership in collaboration with the Centre for Equalities and Inclusion.

Back to table of contents

5. What is included in the Proof of Concept (PoC)

For this Proof of Concept, data from the 2011 Census have been linked to education and attainment information from a bespoke extract of the feasibility All Education Dataset for England (AEDE) data from the Department for Education (DfE). Further information on the feasibility AEDE can be found on the published Feasibility AEDE source overview.

Linking the feasibility AEDE data to the Census brings personal, family and household characteristics together with educational attainment information. This is illustrated in Figure 1 , where the blue boxes highlight the data included in the dataset.

All Education Dataset for England (AEDE)

Created by the DfE, the feasibility AEDE is a large longitudinal record-level education dataset that covers government-funded education in England up to the academic year 2014/15. The dataset is created from the National Pupil Database (NPD), Further Education (FE) and Higher Education (HE) data.

  • The NPD is an administrative datastore that is held by the DfE and includes English school census and attainment information from the Young Person’s Matched Administrative Dataset (YPMAD); students’ socio-demographic characteristics are obtained from the termly school census, pupil referral unit and alternative provision censuses – these are linked to attainment data recorded by awarding bodies.

  • FE data comprise of Individualised Learner Record (ILR) and include socio-demographic characteristics of individuals in further education and work-based learning in England and attainment information.

  • HE data are collected by the Higher Education Statistics Agency (HESA) all government-funded higher education institutes in the UK are required to send data to HESA as well as further education institutes where higher education is delivered; HE data contain information on the socio-demographic characteristics of students and any qualifications obtained.

All personal identifiers in the feasibility AEDE held by the ONS are pseudonymised (made non-identifiable) to ensure confidentiality; the method used ensures identifiable information is not revealed but can be used for data linkage.

For the purposes of this PoC, only data from the NPD have been linked to the 2011 Census. Consequently, the feasibility AEDE extract for this PoC includes only spring English School Census (ESC), Key stage 4 (KS4) and Key stage 5 (KS5) attainment data from the NPD – FE and HE data are not included.

English School Census

The ESC is a collection of pupil- and school-level information. The PoC collection includes:

  • secondary

  • middle-deemed secondary

  • local authority maintained special and non-maintained special schools

  • academies including free schools

  • studio schools, university technical colleges and city technology colleges in England; service children’s education schools may also participate on a voluntary basis

The data are collated by Local Authorities (LAs) into electronic returns and submitted to the DfE via a secure online data transfer system. There is published information on School Census: Data quality and processing, which is collected each term (that is, three times a year), however the data used in PoC relate to the January (Spring term) data collection.

The ESC does not include data from independent schools, and only collects information about individuals attending state-funded schools in England. The information on these individuals includes:

To summarise, the ESC extract included in the PoC contains information only on ethnicity, language and mobility and it does not include information on FSM, SEN or exclusions.

Attainment data

Attainment data for KS4 and KS5 are submitted to the DfE from approximately 150 awarding bodies; the attainment data are then linked to student data.

For the purpose of this report, the data used for this linkage will be referred to as feasibility AEDE, although noting that it is not the full feasibility AEDE, but the bespoke extract described above.

2011 Census

The census takes place every 10 years and gives the most accurate estimate of all the people and households in England and Wales. It provides a snapshot of family make-up and relationships within households as well as demographic and socio-economic characteristics for almost the entire population.

The 2011 Census holds much valuable information including ethnicity, main language spoken, country of birth, and religion to name just a few. Although the 2011 Census data is nearly 10 years old, the information remains relevant as some personal characteristics (such as country of birth) are generally stable over time. The 2011 Census took place on 27 March 2011, and the population of England and Wales on this day was 56,075,912. There is published information on how ONS processed the information for the 2011 Census.

Proof of Concept (PoC) coverage

The analysis presented in this report focused on the feasibility of using this type of linked data to understand factors associated with educational attainment. The linked PoC dataset contains approximatively 2 million children in 2011 and 8 million household members. The size of the populations used in the analysis were slightly lower than these figures, because of issues identified post-linkage. The analysis section below provides more information on the sample sizes and issues identified.

The size of the longitudinal sample allows for multiple disaggregation which would not be possible using traditional survey data, with a more limited sample size.

Only pseudonymised data has been used in this analysis and results are shown at an aggregated level, so individuals cannot be identified.

Students are contained in the linked PoC dataset if they were present in the 2011 Census and enrolled in government-funded education in England at any point between the academic years 2010 /11 and 2014/15. For 2011, this creates a cohort of children enrolled in school aged between 13 and 18 years in KS4 or KS5. A representation of the age groups included in the PoC dataset is shown in Figure 2.

Further details on linkage methodology, results and quality can be found in Section 7: Quality and Methodology.

Back to table of contents

6. Feasibility analysis of the Proof of Concept (PoC) dataset

A child’s socioeconomic background is an important determinant of their chances of future life success. The performance gap between advantaged and disadvantaged children develops at an early age and widens throughout pupils’ lives. Research is therefore needed to better understand the barriers and gateways to social mobility to inform public policy targeted at disadvantaged children and young people.

Currently, there are a number of data gaps within the existing evidence base that have prevented research from being conducted into the interaction between characteristics, such as religion and family background, and educational attainment. Where data are available for important characteristics, often sample sizes prevent the ability of researchers to produce robust estimates for the smallest groups within our society, meaning that these groups are not routinely reflected in the statistics produced.

By bringing together information relating to the personal characteristics of children and their family members with their educational attainment, this dataset is uniquely placed to provide insight to better understand the effect of factors such as personal and familial characteristics, and geography on educational outcomes, increasing our understanding of the nuanced interactions between factors that lead to disadvantage throughout the life course. Early research to demonstrate the potential of administrative data to provide information on educational qualification, collected by the census since 1961 was published in October 2019.

The feasibility analysis of this Proof of Concept (PoC) dataset was carried out in collaboration with the Centre for Equalities and Inclusion. The aim of this analysis was to assess whether this new linked dataset added value to the existing evidence base surrounding the educational outcomes of children. To ensure this analysis reflected the requirements of researchers across the field, our analysis was steered by a working group of representatives from across government, academia and third sector organisations. We are grateful to the working group for their input over the course of the analysis.

Because of the limitations of the PoC dataset, the numbers provided in this section are solely illustrative and should only be used to provide researchers intending to conduct analysis on future iterations of the dataset with an estimate of the size of their populations of interest. The results in this section should not be used to make assumptions about the interaction between a child’s personal or familial characteristics and their associated educational attainment.

It is important to note that the characteristics of household members are as reported in Census 2011, while educational attainment information is taken from a longitudinal source and could have been attained in any of the academic years between 2010/11 and 2014/15. It is also possible that the characteristics recorded for children were assigned to them by a parent or other household member, rather than by the child self-identifying.

Throughout this section, numbers have been rounded to the nearest multiple of 10 as per the Secure Research Service’s (SRS’s) standard rules on Statistical Disclosure Control (SDC) to ensure information is not identifiable.

Measuring educational attainment

For this analysis, educational attainment was measured using the levels defined in Table 1.

Initially the scope of the feasibility analysis included level of attainment by Key stage 4 (KS4), Key stage 5 (KS5), and a measure of progress between KS4 and KS5. However, as the attainment variables used are cumulative and do not have an associated key stage indicator, while it would have been possible to use age as a proxy for key stage, it would not have been possible to differentiate between when a qualification of the same level was attained. For example, if a child attained a “Level 3” qualification in KS4, there would be no way to identify whether this same child had attained another “Level 3” qualification in KS5.

In addition, the PoC also does not include attainment for all respondents at KS5, as some children will have completed their education in an educational setting that is not captured in the PoC, for example an independent school or a school outside of England, or may not have gone on to complete KS5.

Given the cumulative nature of the attainment data, it was not possible to tell whether a child had dropped out of the dataset, potentially having achieved higher level qualifications elsewhere, or just had not attained any higher-level qualifications at KS5. For these reasons, the analysis instead covers highest educational attainment within the period the individual is captured within the dataset.

Personal characteristics and educational attainment

Because of the way in which the data were linked, the PoC dataset is not considered representative for those aged 13 years as of 31 August 2011, and so this analysis covers only children aged 14 to 18 years as of 31 August 2011. A number of children also had to be excluded from the analysis because of issues that arose post-linkage, which meant it was not possible to combine their educational attainment information with their corresponding characteristics collected from the Census. This resulted in the total population of cohort children within the PoC dataset falling from approximately 1.9 million to approximately 1.7 million children.

Tables 2 to 10 provide information on the population sizes of children within the PoC dataset broken down by a range of personal characteristics and educational attainment.


Familial characteristics and educational attainment

The relationship information used to derive household structure for the familial analysis comes from the 2011 Census. Each census form collected information for a maximum of six people, with households of more than six people having to request supplementary form(s) to provide information on remaining household members. Relationship information captured on supplementary forms links only to the first member of the household, meaning that the family structure of households with more than six members cannot be derived completely, and so these households have been excluded from this analysis.

The scope of this analysis included exploring the differential impact of the sex of the parent and their characteristics on the outcomes of the child. The category “mother” was assigned to all female parents and “father” to all male parents. While deriving whether a family member was the child of interest’s mother or father, there were cases where parents had “invalid” recorded as their sex. As such, it was not possible in these cases to assign a category of “mother” or “father” and so these cases have been excluded from the parent analysis in Tables 20 to 27.

For households with multiple mothers or fathers, the parent analysis includes breakdowns for each mother and father, where Mother 1 is the first recorded mother within the household (based on “Person number” from the Census), Mother 2 is the second recorded mother, and so on.

It should be noted that there were a small number of households containing three or more parents. Because of the low number of such households, these have been removed from the analysis.

There were also a number of other dataset limitations that resulted in further decreases in the base population for this analysis. Identification issues that arose in household members post-linkage meant that a complete household structure could not be derived for some households, with these households being dropped before analysis began. These identification issues also meant that in some cases it was not possible to join relationship information with the characteristics of household members. Such cases have been excluded from the corresponding analysis.

No-parent households could not be directly identified for this analysis, instead the following figures refer to cases where the mother and father identities are missing. This could be where the parent’s sex was listed as “invalid” and was therefore excluded from the analysis, or where further identification issues arose.

Summary of exclusions:

  • households of more than six members

  • parents with “invalid” sex

  • where complete household structures could not be derived

  • where joining relationship information and characteristics of household member was not possible

  • cases where both mother and father identities were missing

Household characteristics

Characteristics of mother

Characteristics of father

Characteristics of parents by educational attainment

Vulnerable groups and educational attainment

There are a number of characteristics used by the Office of the Children’s Commissioner for England (CCO) to identify vulnerable children. The characteristics of vulnerable children that are captured within the Proof of Concept (PoC) dataset are children from minority ethnic backgrounds, young carers, and lone-parent families.

Geography and educational attainment

Back to table of contents

7. Quality and methodology

Dimensions of quality

To ensure a broad understanding of our work and quality, within this work, we adhere to the Code of Practice for Official Statistics (PDF, 5.77KB) and use the European Statistical System’s Dimensions of Quality to inform users about the quality of the data.

Data preparation of the All Education Dataset for England (AEDE)


To build the dataset, we have taken a series of preparatory steps which included pre-processing of data and hashing of data (described under data preparation). Before describing the linkage methodology in more detail, Figure 3 provides an overview of the linkage work and the datasets referred to in this article.

Data preparation The pre-processing of data included geo-referencing, variable standardisation and matchkey creation. Geo-referencing involves referencing data to a specific and fixed point, using a geographic classification and a grid of reference. Through variable standardisation, all variables are placed on the same scale to allow for comparisons. For example, if an individual’s forename is recorded as Anne-Marie on one dataset, but as Annemarie on the other, the standardisation removes non-alphabet characters and capitalises them, so the name will appear as ANNEMARIE on both datasets and this forename will link post encryption. Standardisation of variables includes cleaning linkage variables on all data sources to improve linkage rates. Additional processing is completed on the linkage variables to build matchkeys.

To protect confidentiality, all personal data used to create the Growing Up in England (GUIE) linked dataset such as names and dates of birth were pseudonymised (made non-identifiable) through a pseudonymisation process. Data pseudonymisation, or hashing, is a one-way, irreversible process where data are pseudonymised by transforming the raw, identifiable data into a unique string of letters and numbers. The nature of the hashing process means that only in cases where two records are identical, where names, dates of birth and addresses are recorded in precisely the same format, will an automatic match be possible on the hashed values. In cases where there are spelling errors or inconsistencies between two records relating to the same individual (for example the names Samantha and Sam or John and Jon), the hash values will not be identifiable as being similar.

Feasibility AEDE data processing

The feasibility AEDE consists of several files of data with information on attainment, pupil attributes and pseudonymised person identifiers all separated. To use the data for linkage, multiple steps were required to pull different aspects of an individual together. The process for building the feasibility AEDE subset for linkage is described next, and illustrated in Figure 4.

First, the feasibility AEDE attributes containing the source and academic year variables were linked to the AEDE index on the Pupil Matching Reference (PMR) number. The PMR gives each pupil an pseudonymised identifier, which is unique to them and allows matching across datasets without giving away their identity. The purpose of this step was to get the Unique ID from the AEDE index onto the attributes which is needed to link to the matchkey file.

Once the feasibility AEDE attributes had been linked to the AEDE index, the AEDE index was subset into the 2010/11 academic year where School Census was identified as the source dataset. Multiple entries of individual records have been removed; this process is known as de-duplication which was completed on the PMR and using a nodupkey procedure in SAS. The nodupkey procedure retains only the first instance of a record.

Having removed all the duplicated records, the final AEDE subset for linkage contained 2,250,655 individuals. These individuals were then linked to the original matchkey file on the PMR to extract the correct matchkey records for linkage.

The resulting AEDE matchkey file, created for linkage, contained over 161 million records covering the academic years 2000/01 to 2014/15. For the purposes of this AEDE-Census linkage, the file was subset into the School Census records for the 2010/11 academic year, because this was the year closest to 2011 Census. This is important because the information collected in this year is most likely to match that of the Census, in particular a person’s address details, and so increasing the number of records which will link.

Creating matchkeys

Census data and the feasibility AEDE do not contain a single common identifier that could be used to easily link corresponding records from one dataset to the other, for example a unique number for each individual that is common to both datasets. Therefore, a series of matchkeys containing different combinations of pseudonymised person information, including name, date of birth, gender and postcode, was used to link the AEDE to the Census. For example, forename, surname, date of birth and postcode may be combined and for each member of the population would be expected to retain a high level of uniqueness. As previously mentioned, identifiable data were hashed and used to link records between datasets in the anonymous data research environment.

It is expected that administrative data and survey data will contain some level of error in the capture and quality of the information contained therein. These errors can prevent links being made where they should be (these missed matches are known as false negatives). To help reduce the likelihood of false negatives, nine matchkeys were created and used, some of which allow for small amounts of error within the identifier variables, for instance difference in name spellings or where gender may be missing. Each matchkey is designed to gradually eliminate some of the discrepancies that may otherwise prevent automated matching (Figure 4).

The matchkeys were run in order of strength, by which we mean how able the matchkey is to discern between truly different records. This matchkey ordering differs from the numbering of the matchkeys provided in Table 38. Matchkeys 1 to 11 allow only exact matches on all the selected variables. The standard available matchkeys, developed for use when matching data to the Census, and the information contained within them are shown in Table 38. For more detailed methodology on matchkeys and linking pseudonymised data, see Beyond 2011 data linkage methods (PDF, 319.9KB).

Deterministic linking – Census to feasibility AEDE

Table 39 shows the linkage rates for each matchkey for all AEDE-Census records. Out of 2,250,655 records on the AEDE, 2,035,289 records (90%) linked to a corresponding Census record. Two-thirds (66.38%) of the linked AEDE records linked to the Census on the strongest matchkey (1) followed by 19.75%, which linked on matchkey 3.

To link the Census to AEDE, matchkeys from each data source are compared and if they agree, a match is established. Each matchkey is applied to the datasets in an order whereby the amount of error allowed between the datasets is increased gradually. In this way the best quality links are formed earlier. To avoid the possibility of creating false positive links (that is, matching records that should not be matched, for example, two different people) records are only linked on a matchkey if it is unique on both datasets. Where multiple records link on the same key, the link is disregarded and the records are passed on as a residual to the next pass. Matches made in the early stages are given priority over those made on later (weaker) links.

Once the AEDE-Census linkage was complete, a cohort of children aged 13 to 18 years on 31 August 2011 in the Census was extracted from this linked dataset to create the initial “index” of 1,920,091 records for the Growing Up in England (GUIE) build. These records were selected as they represent the analytical cohort. This index consists of the unique Census person identifiers and household identifiers, and the unique identifier numbers provided within the feasibility AEDE. The index was used as a base on which to build the Census and feasibility AEDE attributes tables. The unique IDs were replaced with pseudo IDs on all tables provided to researchers to protect the identity of individuals within the datasets.

The next step was to link in other household information collected in the Census, and pupil attributes and attainment from the feasibility AEDE, to the cohort to produce the Census and feasibility AEDE attributes tables for analysis. Figure 5 shows which tables of data were linked and produced.

Back to table of contents

8. Linkage results

Quality checks

On the creation of the AEDE attainment tables, an anomaly was discovered by which, of the 1,920,091 records in the AEDE-Census linked cohort, only 1,236,071 linked to a record from academic year 2010/11 attainment data. Further investigation revealed that a number of individuals in the AEDE which had linked to the Census were found to be present in AEDE attainment from academic year 2011/12 onwards but were not found to be recorded in the 2010/11 academic year. The reason for this is unclear and may be a quality issue in the AEDE or as a result of lag in the recording of data in the AEDE. This anomaly means that the number of attainment records increases through the years. For the purposes of consistency, only cohort records that linked to an attainment record in academic year 2010/11 were retained in the attainment records for each of the academic years.

To evaluate the quality of the linkage, a series of age distributions was created (Figure 6).

Sample sizes: Feasibility AEDE index 2010/11 – 7,143,303
Feasibility AEDE subset for linkage – 2,250,655
Feasibility AEDE-Census Linked cohort – 2,035,212
Unlinked Feasibility AEDE – 215,366

The age distributions show that, aside from the AEDE index 2010/11, there is consistency between the feasibility AEDE subset for linkage, feasibility AEDE-Census linked cohort and the unlinked feasibility AEDE, although the proportion of 14-year-olds in the unlinked AEDE is slightly higher.

The lower numbers of 13-year-olds in the linkage outputs reflect the lower proportion of this age within the AEDE index dataset. However, with regards to the low numbers of 17- and 18-year-olds, although they are consistent within the outputs, it is unclear as to why they would be much lower than those in the AEDE index dataset and may need further investigation going forward. In conclusion, consistency across age distribution indicates that there was no age bias in the linkage.

Unlinked feasibility AEDE records

215,366 AEDE records did not link to a Census record. This is 10% of the total number of records extracted for the linkage. Where we have seen that there are no sizeable differences in the age distributions shown in Figure 6, this suggests that age was not the cause of these records not linking.

Boarders were of particular interest because they may be listed at a different address on feasibility AEDE to their usual address on the Census. Therefore, a check on the numbers of boarders within the unlinked feasibility AEDE was carried out to check for any linkage bias (Table 40). It can be seen that 212,449 records of a total of 215,366 unlinked records (98%) were not boarders. Therefore, this is not solely responsible for records not linking to the Census.

One of the limitations of linking pseudonymised data is that it is difficult to identify where specific errors in the recording of variables affect the ability for the algorithm to make matches. Without being able to clerically review identifiable record-level data, it is therefore difficult to correct for error in the variables, as well as difficult to calculate false positive and false negative errors (that is, records that should have been matched but were missed and records that did match that should not have).

Back to table of contents

9. Lessons learned and next steps

This report showed that a sufficiently high proportion of records linked between these datasets, and the majority of these were made on strong matchkeys, giving confidence that those links are correct; this demonstrates the feasibility of the linkage. It is recommended therefore that research under the Growing Up in England (GUIE) theme should continue and should consider linkage as an important part of that work.

In this specific case, it would be interesting to further investigate the linkage quality using clerical samples as well as to answer outstanding questions such as why 17- and 18-year olds appear differently in age distributions.

Where individuals in the feasibility AEDE attainment data were not found to be in the 2010/11 academic year but appeared in later years, the linked attainment tables were re-run to provide consistency. This involved linking the cohort to the 2010/11 academic year of attainment and only bringing those cohort members into the longitudinal file. This step can be incorporated into the methodology for the future GUIE build going forward into the new linkage environment.

These, together with the issues identified during the analysis stage, will help inform the future iterations of the GUIE dataset. This PoC dataset is a feasibility study on the potential of linked administrative data.

The next steps for the GUIE project include the re-creation of this linked dataset, called GUIE Wave 1, in a new linkage environment at the Office for National Statistics (ONS). This will allow for a better matching exercise, as the quality of matches can be reviewed when linkage is conducted. This opens the potential for creating additional matchkeys that can consider more and different types of error. The GUIE Wave 1 dataset will be made available to accredited researchers with approved projects to use via the Secure Research Service (SRS). The expected delivery of this dataset into the SRS is late 2020.

Back to table of contents

10. Annex A: Glossary

Data linkage

The act of bringing two or more datasets from different sources together, creating associations between the data. Data linkage can provide new statistical insights not possible with information from a single source.

Data processing

Data processing is the method applied to convert data into a format that can be interpreted, analysed and used for a variety of purposes.

Data quality

An essential characteristic that determines the reliability of data for making decisions. High-quality data are complete, accurate, available and timely.


De-identified data do not contain any personal identifiable information, such as name, address, postcodes etc. The identifiers are removed from the records before de-identified microdata is securely transferred to the Secure Research System (the secure environment where access is controlled).

Back to table of contents

11. Annex B: Highest educational attainment of child by Local Authority 2011

Back to table of contents

Contact details for this Methodology

Diana Airimitoaie
Telephone: +44 (0) 1329 44 7871