Policy for social survey microdata

Social survey microdata are a useful research tool for a range of users. Microdata files of different degrees of disclosure risk are made available to other government departments and to academic and other researchers. This is in accordance with Principle 1 of the Code of Practice for Official Statistics (CoP).

This policy provides guidance on releasing microdata in accordance with the CoP, specifically Principle 5: Confidentiality. It also conforms to the Statistics and Registration Service Act (SRSA) (2007), specifically Section 39.

The main section discusses how the CoP and SRSA affect the release of microdata and includes examples of our good practice.

Appendix 1 contains guidance on preparing microdata so that they are not personal information. Further appendices contain reference information.

This document has been approved as GSS Policy by the GSS Statistical Policy and Standards Committee.

Guidance for the release of an open data file

Introduction

This guidance is designed for quick reference for public sector organisations when producing and publishing an open data file, the equivalent of a public use dataset. The aim of this guidance is to show you the basic concepts behind the production of a dataset where the likelihood of disclosure is negligible. It is not a formal Government Statistical Service (GSS) paper, but aims to guide you through the main points of the process.

These ideas can be applied to both business and social surveys to assist in producing record-level data where individual units will not be identifiable (this is a legal requirement for GSS data released under an Open Licence).

There is increasing demand for record-level data to be made available to the public without the restriction of licensing agreements. For this to happen, statistical disclosure control (SDC) will almost certainly need to be applied to the data to ensure that the dataset has sufficient protection against the risk of an individual, household, business or other statistical unit being identified. This guidance includes the steps that should be followed when producing an open dataset. To place open data in context consider the current microdata release options.

There are three main tiers of data release (one of which is open data):

Open data released under an Open Government Licence (OGL)

For our releases, these datasets contain no personal information as defined under the Statistics and Registration Service Act (SRSA 2007) and all GSS releases should also not be personal data under the Data Protection Act (DPA 1998). For further background see Section 39 of the SRSA for the definition of personal information and Section 1 of the DPA for that of personal data. Individuals, households and businesses will not be identifiable from the released data; consequently there are few restrictions to use. Registration is not required to use these data.

Uses: For distribution to students as a teaching dataset. Basic exploratory analysis may also be feasible. The data structure may also be useful in developing code for testing, or for developments in methodology.

Safeguarded data released under an End User Licence (EUL)

These datasets are not personal information but may be personal data. Users will be required to sign a declaration before gaining access to the data. One condition of an EUL is that the user must maintain the confidentiality of the data and should not attempt to identify any individual, household or business in the data, nor to claim to have made an identification (in the case of spontaneous recognition).

Uses: For straightforward but not excessively detailed data analysis, modelling using standard techniques can be applied but the dataset may not be suitable for the most thorough analysis.

Restricted or controlled access

In the Office for National Statistics (ONS) this access would be via a legal gateway such as to approved researchers. These data are identifiable and potentially disclosive. Data are usually accessible through a secure data laboratory following a successful application and an introductory training session. Outputs are checked by the data manager prior to release to the user.

Uses: For the most detailed analysis. This dataset should be suitable for all analytical requirements.

More details on data access can be found on the UK Data Service website.

Note that for non-public sector organisations, one could publish open data under a Creative Commons Licence (CCL).

The remainder of this guidance discusses the open data option. The term “open data” is often used interchangeably with “public use data” and these are often used as teaching or training datasets. They can also allow code to be tested and checked before it is run on a more complete dataset that has been released under licence.

Depending on the level of statistical disclosure control (SDC) applied the open data may be of limited use to researchers. Data might be recoded and/or only a limited number of variables released.

Public sector information published under an Open Government Licence gives the user considerable freedom in how the data are used. They are allowed to publish, adapt and combine with other data as long as the resulting information is not personal data. The only conditions are that the user must acknowledge the data source and must not misrepresent the data. It is therefore a difficult balancing act between producing open data which are of some use and that which are protected to a suitable level even when combined with other data sources.

Dataset background

Prior to publishing the file it will be useful to understand the data, understand the users and understand their main needs. The aim of this is to ensure that the data are of maximum utility and that time is not wasted preparing records and variables that do not interest the user.

If it is possible, talk to potential users of the data. What type of research will the data be used for? In particular:

what variables are they most interested in?
what level of detail is required for these variables?
what level of geography is required?

Is the original dataset from a sample survey, an administrative dataset or a census? If it is a sample survey then there is already some protection, as there is doubt as to whether a member of the population appears in the sample. However, it is possible that a user may have some response knowledge – that is, they may be aware through conversation that one of their friends, relatives or acquaintances took part in a specific survey. One might consider this scenario within the risk assessment. If it is an administrative dataset or from a census then there is the option of releasing a sample from the complete data.

These factors need to be taken into consideration when looking at the more detailed following steps. This should ensure that the most appropriate open dataset is published. Data users will have different requirements but if there are common themes then attempts can be made to ensure that particular variables are published with as much detail as possible.

Key variables

These are variables that are most likely to lead to confidential information being found in a dataset. They are typically visible variables (possibly that an intruder might know through observation) or sensitive variables (if known by an intruder, it would be likely to assist in an immediate identification).

An initial step is to consider the level of geography at which the data are to be published. Open data are likely to not be too detailed and the standard level of geography within the UK is likely to be region or country depending on the other variables in the data and the total number of cases within the various levels of geography.

Familiarity with the data should enable a short-list of additional key variables for each dataset to be drawn up. A small selection of common key variables from a range of datasets is:

age (individual or grouped)
sex
health indicator (more likely to be a key variable if a specific condition)
size or composition of household
income (household or individual)
occupation or industry
ethnic group
religion
country of birth
marital status

There will be other key variables unique to particular datasets. These could include such variables as:

house type
house age
floor size
college course
course provider
number of dependent children in household

Often in a published open dataset there is a “response” variable for each record, the most important “outcome” from each respondent record that relates to the specific purpose of the collection. Examples are:

income
total amount of benefit received by the household
household expenditure
domestic gas or electricity consumption
exam grade

Note that these lists are not prescriptive; the risks of each dataset should be considered separately with reference to statistical disclosure control issues. There may be variables specific to the topic where as much detail as possible should be retained.

The published dataset should be protected to ensure that these individual attributes cannot be associated with a particular individual, household or business. Combinations of key variables should be tabulated to look for unique and rare records with respect to these key variable combinations.

Some datasets might not be able to be released as open data, due to the sensitivities of the topic matter. A guide to some of the variables considered more sensitive can be taken from Section 2 of the Data Protection Act (DPA). The level of disclosure control required would almost certainly make the datasets’ use limited even as a teaching resource. These could include:

Personal Health Survey
Social Attitudes Survey (personal views on ethnicity, religion, drug use)
Victim of Crime Survey

Now consider how a dataset could be shown to be disclosive by looking at a number of intruder scenarios. These include combining the open data with other data sources.

Intruder scenarios

An intruder (or attacker) is somebody who attempts to discover personal information about an individual, household or business in the dataset. These are specific situations where this may occur and typically involve assumptions about the intruder's knowledge of a member or members of the dataset usually with respect to a number of variables known as key variables.

An intruder may:

attempt to match a number of individuals or other statistical units in the data with other sources; these will share a characteristic of interest such as ethnic group, household composition or occupation
attempt to find a specific individual, household or business as they have knowledge of information unique to this record (including, perhaps, response knowledge)
attempt to find an arbitrary record in order to demonstrate that the published data are not secure
attempt to find a record in the survey and use this information to find out more about this record externally; this is the reverse of the previous types of attack

Scenarios relating to open data follow.

Scenario 1 – Use of published datasets

An intruder in possession of published datasets can use key variables to match these against the published open data. This may enable identification of an individual or other statistical unit if the published data include direct identifiers.

Examples of such published information are:

the Electoral Register or equivalent
Land Registry and other property databases
192.com and similar searchable websites for people and businesses
commercial datasets, such as consumer profile databases available for the public to purchase
personal or business websites, for example, for the self-employed

Variables on these datasets will include information on a wide range of variables for the household and adults within. With this level of detail it might not be difficult to link with a distinctive record in the published open dataset.

Typical variables on these datasets are as follows. It is likely that some of these will be banded or aggregated in some form:

name
address
postcode
age
sex
ethnicity
number of cars
number of children
size of household
tenure
house type
number of rooms
occupation
working status
income
qualifications (academic or vocational)

This scenario assumes that an intruder would link published information with the released microdata using a selection of these variables. They could then have the direct identifiers, such as name and address, linked with all the other information in the released dataset and there would be a high probability of these being correct matches. If it is possible for an intruder to use a previously published dataset A to identify an individual in the new dataset B, the dataset B is said to be personal information. In ONS, publication of B as an open dataset would breach the Statistics and Registration Service Act (SRSA) and, for all public sector bodies, would be a breach of the Code of Practice and Data Protection Act.

Scenario 2 – Spontaneous recognition

An intruder may spontaneously recognise an individual or individuals or business in the microdata by means of published information. This can occur for instance when a respondent has unusual characteristics and the individual is in the public domain or a business is known to the intruder. This may not necessarily be malicious and could occur without the intruder attempting to identify the individual case. They may still be referred to as an intruder, despite the lack of intent.

The variables that the intruder might know about a friend, family member or work colleague (and thus use to help in identifying the individual for this scenario)are – for individuals:

name – this must always be removed from datasets to be published
age (possibly banded in the dataset)
sex
marital status
income (possibly top and bottom coded)
occupation (possibly not detailed in the dataset)
address (if a geography variable is included in the dataset)
housing variables, such as accommodation type and tenure
ethnic group (where unusual)

While for businesses the intruder might know:

industry sector
location

Scenario 3 – Nosy neighbour

This has some similarities to spontaneous recognition except that here the knowledge does not involve published information. Instead the scenario relies exclusively on private knowledge that only a small number of people such as neighbours or close friends would know. In this scenario it is more likely that one particular individual will be targeted. The key variables could include those in Scenario 2 plus some specific household variables such as:

number of cars – external visible information
number of children – likely to be known to neighbours
size of household – likely to be known to neighbours
accommodation type – external visible information

It is sometimes a challenge to protect against Scenarios 2 and 3 as there are a wide range of variables that could potentially be used by an intruder. If it is thought likely to be a serious risk in a number of cases (but would require privately held knowledge about the individual), some protection would be appropriate through safeguarding, which is equivalent to the End User Licence (EUL) as mentioned in the introduction. The GSS Guidance for Microdata from Social Surveys is a formal document that discusses microdata largely in terms of safeguarded (EUL) outputs.

Results from variable combinations

Select a set of key variables (5 or 6 for each dataset) and tabulate combinations to create a series of 2, 3 and 4 dimension tables. These combinations should be plausible, that is, likely to be similar to tables required by researchers.

Example combinations for social survey and census data are:

region by sex by age (individual and/or grouped)
region by age by marital status
region by sex by marital status
region by age by occupation
region by income by age by sex

For business surveys combinations could include:

region by industry sector
region by number of employees

In general, those with extensive knowledge of the data should take the lead when deciding on a suitable range of combinations for selection. It should be noted that creating tables with a large number of variables will be counterproductive as patterns may emerge that would not be noticed by a researcher. Most records are unique if a large number of variables are combined.

When creating these variable combinations, it is important to always consider what an intruder is likely to know (and the extra information they want to find out from the data) when deciding on which variable combinations to consider.

The following steps should be considered for each combination. Note that there could be some data quality issues within some of the variables, for example, where prone to capture or coding error, but it is not necessary to isolate these from the testing. Even in poor quality data, there are still likely to be many records with perfectly accurate information or data recorded.

Step 1: Examine the outputs

Look for low counts in the tables. In particular:

sparse tables with a low average cell frequency may be problematic; possibly could set a minimum average frequency
a count of 1 shows that there is a unique combination in the population or sample
other low counts show that the combination is rare

Combinations with low counts (especially counts of 1) could allow an intruder with knowledge of an individual or business with these characteristics in the data to make an identification. This identification may not be correct due to a number of factors (record is in the population but not the sample, intruder incorrectly remembering the characteristics, data transcribed incorrectly) but the perception of disclosure is still present, which could cause the data provider reputational issues.

Where rows and/or columns have all the frequencies in a single cell, or have an uneven distribution of frequencies, those combinations ought to be considered closely. They could be characteristic of the data or indicate that the sample (if it is a sample) is not a satisfactory representation of the population.

Step 2: Apply statistical disclosure control

If there are any combinations with the characteristics described in Step 1, particularly unique combinations, there are a number of options. A single option can be followed or they can be used in combination.

Recode some of the key variables so there are fewer categories. Examples are:

use a high level of geography; country or region rather than local authority (using a high geography is likely to reduce the risk considerably)
recode age into 5 or 10 year age-groups
top-code annual income at an appropriate level, for example, £100,000
recode country of birth to UK or non-UK (or to a small number of categories)
reduce the number of categories in the marital status variable

These are specific examples. Familiarity with the data will enable the data provider to decide on the best recoding policy. If a user representative can be involved at this stage it could help in ensuring that the published dataset is still of reasonably high utility for the intended purpose.

Carry out other SDC techniques on the data. These include:

Perturbation

Change the categories of values in some records including all the unique or rare records. For continuous variables such as income one could add or subtract a random amount from the reported value. This could be carried out on a selection of records and not just those that are potentially identifiable.

Suppression

Suppress values in some records including in all the unique or rare records. Suppressed values are likely to stand out in a dataset and it is important to ensure that these values do not lead to an inference as to the reason for suppression and therefore the suppressed value itself.

Remove one or more key variables

This is an option if a particular variable or variables are especially disclosive, for example, if a detailed occupation variable could be revealing when combined with other key variables. If there is a related variable such as industry in the data then occupation could be removed rather than recoded. This should only be done if it can be shown that dropping the variable will not have a great effect on the resulting data utility. In particular one should consider whether the removal of a variable would affect any analysis that the user might hope to carry out. It is useful to discuss with the business area and/or key users as to the best options before any variables are removed from the data.

Remove records with unique or rare combinations

This will protect the data but may damage any analysis. If this method is applied remove a small selection of records and carry out before and after analysis to see if there is a difference in the results. It is possible that the removed records will be of most interest to users of the data. There may also be a conflict with high-profile published tables results on the original data. User consultation should take place if a large number of records are planned to be removed.

If the dataset covers the whole population, take a sample

This will increase the level of doubt for an intruder. They may know that an individual or business is in the population but publishing a sample provides uncertainty that a record with known characteristics is actually the individual they are looking for. Any uniques in the sample may or may not be population uniques. An intruder is unlikely to know this so there will be an element of uncertainty in any identification. Additional protection can be given by perturbing some of the variables in records, which are population uniques. The doubt will increase as the sample size decreases. A small sample such as those carried out for social surveys (1%, 2%) give considerable protection to the data whereas if a decision is made to produce a cut-down version of a population dataset (50% sample, for example) the level of risk will be higher. For an open dataset, a cut-down percentage of around 1 to 10% may be suitable in many cases.

Swap records in the dataset

Records that are the same with respect to the key variables but in different geographical locations can be swapped. Open data are usually produced at region level (or higher) so it makes sense to swap between that level of geography. In addition, swapping a small number of the most risky records is a practical approach, which often avoids the need to reduce the detail of many other variables. It is advisable to only swap a small number of records, and this should really be a last resort after a number of variables have already been collapsed or removed, and just a few uniques remain.

All of these methods of statistical disclosure control will damage the data. The level of disclosure control required for an open data release may make the resulting file too restrictive for some researchers, bearing in mind that the data are only likely to be useful for either teaching purposes, testing code or developing methodology. For example, they might need a more detailed country of birth variable or more information on high-level incomes. Releasing an EUL dataset alongside the open data file may be beneficial for such cases. Ideally researchers should use a dataset most suitable for their requirements.

Step 3: Recreate the tables

After applying the changes to the data, repeat the table combinations from earlier. The outputs should be viewed in two ways.

How many unique and rare combinations remain in the tables? How many of these are genuine combinations? If they have been created by the implementation of SDC (for example, perturbation) then they are not “true” uniques and can be published.

Compare the outputs from the protected data with those from the original data. Are the tables noticeably different? Formal statistical tests, such as that for goodness of fit, can be applied to see if any differences are significant.

If the data are regarded as having negligible risk of disclosure (and the utility is as high as could be achieved for an open data file) then proceed to Step 4. If it is thought that more disclosure control is required then go back to Step 2. Once a dataset has passed through Steps 1 to 3, it has been designed to be almost ready for public release as an open dataset. An additional step is advisable – though it can be resource-intensive – to test the disclosure risk empirically.

Step 4: Risk assessment

The risk can be assessed by carrying out intruder testing. This does provide practical and empirical evidence of the protection employed, and is acknowledged as good practice by the Information Commissioner’s Office. Depending on the time and money available the data provider can choose one of the following options.

Internal testing

A small number of volunteers with knowledge of the data (for example, those working in the government department that will publish the data, those working in another department that uses the data, graduate students who may use the data in the future) and with internet access spend a period of time (around a half day is normal) trying to identify records in the data. Each identification claim is given a percentage of confidence by the “intruder”.

External testing

An organisation with experience of intruder testing (for example, a university) carries out a formal test to see what can be identified. They could be given a list of people or businesses in the data and asked to identify them. Their success will be measured in terms of the number of correct identifications. As above, each identification claim will have an associated confidence percentage.

Intruder testing is a relatively new technique and there are no defined rules stating how the outcome of the test should be measured. A possible interpretation is:

if there are many correct identifications along with incorrect identifications then go to Step 2 to protect the data further, followed by Steps 3, 4
if there are a small number of correct identifications along with incorrect identifications then go to Step 2 for minor additional protection, followed by Steps 3, 4
if there are no correct identifications and no or a small number of incorrect identifications then SDC may have been applied too stringently; the data provider may decide to reduce the amount of SDC applied in Step 2 and then repeat Steps 3, 4

For a successful conclusion to the intruder testing there should be no more than a small number of correct identifications (or possibly none) and a large percentage of claims should be incorrect. Ideally identifications should have been made on a wide range of records (for example, many different ages, marital statuses, household compositions for individuals) and with a wide range of confidences. In addition if there are a large number of claims (correct or incorrect) on a specific combination of variables, these should be examined closely to determine if they are protected sufficiently. If some of the more confident responses are incorrect this suggests that the data have been protected successfully. Once this position has been reached then go to Step 5.

There are no hard and fast rules as to what constitutes “some” or “many” claims. The context of the dataset is important here, especially relating to sensitivity and public perception, but it should not be possible for correct claims with certainty or near certainty where the data are to be released publicly. A data provider might consider the line of defence s/he would take if there were to be a correct claim made by an intruder after release of such data.

This is only one possible approach. In general the outcome of an intruder test should be based on the accuracy of, and confidence in, the claim along with the sensitivity and visibility of the variables being used. It is these factors that ought to determine which variables (if any) should be recoded or removed from the dataset.

Step 5: Publish data

Once the data have been tabulated, protected and tested then publish under an Open Data Licence for public sector information.

Along with the data, state clearly that disclosure control has been applied to the data. The utility measure comparing the data before and after the application of SDC could be mentioned for specific datasets. This will give an indication of the usefulness of the data.

Summary of steps

Step 1: From the tabulation of the selected variable combinations look for low counts or other distinctive patterns
Step 2: If there are unique or rare combinations in the data apply disclosure control on the problematic variable(s); these could include the removal of variables or records, recoding or record swapping
Step 3: Tabulate the variable combinations again from the “protected” microdata; if there are no obvious disclosure issues go to Step 4, otherwise repeat Step 2 and apply more disclosure control
Step 4: Carry out intruder testing; if there are many successful claims then Steps 2 to 4 will need to be repeated
Step 5: Publish the data under an Open Data Licence

Appendix: Example scenario

A worked example that follows the steps previously described is shown in this appendix.

Table A1 is based on the Census Microdata Teaching File release, a large file of over half a million individuals, and shows the first 10 records. All additional information is completely artificial, and the scenario hypothetical. Variables that are likely to be included in an identification key are highlighted. This dataset is a 1% sample of the population. Note that the ID variable should not replicate, or be a function of, any other dataset and is used purely for reference while using and manipulating this dataset.

Table A1: Original microdata

ID	Region	Residence	Family.Type	Sex	Age	M.Status	C.O.B	Health	Ethnic.Group	Religion	Occupation	Industry	Qualifications
7394816	E12000001	H	2	2	56	2	1	2	1	2	8	2	12
7394745	E12000001	H	5	1	38	1	1	1	1	2	8	6	14
7395066	E12000001	H	3	2	43	1	1	1	1	1	6	11	15
7395329	E12000001	H	3	2	24	1	1	2	1	2	7	7	15
7394712	E12000001	H	3	1	50	4	1	1	1	2	1	4	16
7394750	E12000001	H	2	1	55	2	1	2	1	1	9	2	14
7394871	E12000001	H	5	2	38	3	1	2	1	1	6	11	15
7394832	E12000001	H	3	2	15	1	1	2	1	1	-9	-9	XX
7394719	E12000001	H	2	1	72	2	1	1	1	2	8	2	16
7394840	E12000001	H	1	2	61	4	1	3	1	2	9	5	10
...	...	...	...	...	...	...	...	...	...	...	...	...	...
Source: Office for National Statistics

Download this table Table A1: Original microdata

.xls (30.2 kB)

Step 1: Check combinations of key variables

The key variables are the set of variables that are most likely to be used by an intruder, being either “visible” or in the public domain, or both. These are characteristics that an acquaintance might expect to know about a friend, family member or neighbour (one could extend this to work colleagues, depending on the nature of the dataset). This set of variables is the identification key.

Several combinations of variables are tested to look for records that are very rare or unique on these variables. These combinations should be those that researchers are likely to use.

For example the combination of age, sex, marital status (m.status) and ethnic group should be tested, as well as occupation/qualifications/age, family.type/sex/age and any other likely combinations.

In this case unique values are common for these combinations, due to the presence of a variable showing single year of age. That is an obvious target for an intruder and a good candidate for protection.

Step 2: Apply SDC techniques

In order to reduce the number of unique records the “age” variable was top and bottom-coded and banded into four age bands, as shown in Table A2.

Table A2: Banded age variable

ID	Region	Residence	Family.Type	Sex	Age	M.Status	C.O.B	Health	Ethnic.Group	Religion	Occupation	Industry	Qualifications
7394816	E12000001	H	2	2	4	2	1	2	1	2	8	2	12
7394745	E12000001	H	5	1	3	1	1	1	1	2	8	6	14
7395066	E12000001	H	3	2	3	1	1	1	1	1	6	11	15
7395329	E12000001	H	3	2	2	1	1	2	1	2	7	7	15
7394712	E12000001	H	3	1	3	4	1	1	1	2	1	4	16
7394750	E12000001	H	2	1	4	2	1	2	1	1	9	2	14
7394871	E12000001	H	5	2	3	3	1	2	1	1	6	11	15
7394832	E12000001	H	3	2	1	1	1	2	1	1	-9	-9	XX
7394719	E12000001	H	2	1	4	2	1	1	1	2	8	2	16
7394840	E12000001	H	1	2	4	4	1	3	1	2	9	5	10
...	...	...	...	...	...	...	...	...	...	...	...	...	...
Source: Office for National Statistics

Download this table Table A2: Banded age variable

.xls (30.2 kB)

It was also determined that there were a number of rare or unique records for combinations involving the “qualifications” variable, particularly for certain occupations and ages. Several ways of dealing with this were considered. For example:

perturbation: the values for qualifications could be perturbed where the records would be unique or rare
suppression: the value (or record) could be removed where the record would be unique or rare
variable removal: the entire variable could be removed
swapping: the rare or unique records could be swapped with similar others in different geographies; as geography is at region level this may result in a high level of data damage, however, it can be a useful strategy if there were just a small number of cases, where swapping may help to prevent reduction in detail for other variables

It was determined that the user requirement for a qualifications variable was not very high. Given that perturbation may result in false conclusions and suppression would complicate analysis it was decided to remove the qualifications variable, producing Table A3.

Table A3: Banded age variable and removed Qualifications variable

ID	Region	Residence	Family.Type	Sex	Age	M.Status	C.O.B	Health	Ethnic.Group	Religion	Occupation	Industry
7394816	E12000001	H	2	2	4	2	1	2	1	2	8	2
7394745	E12000001	H	5	1	3	1	1	1	1	2	8	6
7395066	E12000001	H	3	2	3	1	1	1	1	1	6	11
7395329	E12000001	H	3	2	2	1	1	2	1	2	7	7
7394712	E12000001	H	3	1	3	4	1	1	1	2	1	4
7394750	E12000001	H	2	1	4	2	1	2	1	1	9	2
7394871	E12000001	H	5	2	3	3	1	2	1	1	6	11
7394832	E12000001	H	3	2	1	1	1	2	1	1	-9	-9
7394719	E12000001	H	2	1	4	2	1	1	1	2	8	2
7394840	E12000001	H	1	2	4	4	1	3	1	2	9	5
...	...	...	...	...	...	...	...	...	...	...	...	...
Source: Office for National Statistics

Download this table Table A3: Banded age variable and removed Qualifications variable

.xls (29.7 kB)

Step 3: Recreate tables and re-test

Testing with the new “age” variable showed a large decrease in the number of rare and unique records, suggesting that the banding was effective for reducing disclosure risk. The removal of the qualifications variable also greatly reduced rare records for related variable combinations.

At the first point where it is felt that the dataset is feasible for public release, we should move to Step 4.

Step 4: Intruder testing

The resultant microdata set was given to a number of volunteers who had undergone the required security clearances, with the intention that they attempt to identify individuals. The choice of volunteers is often purposive or opportunistic, but it is useful to have a range of individuals who have a range of skills (not necessarily all of these, but certainly some):

familiarity with the topic(s) within the data; this is likely to result in individuals who work either in the department producing the data or one that uses the data
analytical and statistical skills
knowledge of other datasets in the public domain
knowledge of relevant websites with links to the data
ability to search and manipulate data
knowledge of a specific area of the country (especially if different to those known by other intruders)

External experts in the field of intruder testing could also be involved. These could be post-graduate students at a university specialising in disclosure control who are familiar with intruder scenarios and capable of targeting any weaknesses in the data. They could be presented with a list of records in the data and asked to identify them.

Individuals were also provided with unrestricted internet access (for a specified length of time such as a half day) in order that they could attempt to match to information that may be in the public domain. Typically, this could be a personal or business website for an individual who is self-employed, or an academic, or to information placed on social media websites, such as Facebook. Upon an apparent identification the volunteers marked down the names of the individuals they believed that they had identified and the matched ID. They also noted down how confident they were that a real identification had been made. Intruders were also asked to describe the logic in making their claim, highlighting the variables and values that they used and how they had deduced the identification. The results follow in Table A4. This table is purely for the use of the data providers and will not be released to those involved in the intruder testing.

Table A4: Results of first intruder testing exercise

Confidence	Correct Identifications	Incorrect Identifications
High	0	1
Medium	0	2
Low	0	4
Source: Office for National Statistics

Download this table Table A4: Results of first intruder testing exercise

.xls (26.1 kB)

The low number of incorrect identifications, combined with the absence of any correct ones probably means that the risk in the dataset is low enough to be considered suitable for “open data”. However, it might even be deemed too cautious with disclosure control being applied too strictly. The ideal situation is to have a fair number of incorrect identifications and very few (not necessarily nil) correct identifications. For this example it was decided to reattempt intruder testing, but to use eight age bands instead of four age bands; results are shown in Table A5.

The question is then one of risk appetite. The level of risk that one would normally take with an open dataset (and therefore to be publicly available) is usually very low. Always remember that it would be perfectly legal for an intruder or researcher to attempt to identify an individual and to claim to have made a correct identification. It is therefore important for a data provider to have a strategy in place for that instance. This is typically a set of questions and answers that might include an explanation as to why there may be considerable uncertainty in any claim, highlighting in broad terms the disclosure protection measures that have been carried out.

With the second run of intruder testing, we have 3 correct identifications and 28 incorrect. That does appear to be a reasonable level of uncertainty, particularly since those claims that are most confident (and so those where a formal claim is more likely) are actually incorrect. Note once again that we would never disclose whether individual claims are correct or not.

An additional step is to look at the logic of the claims made. A combination of variables commonly used might point to the most likely way in which intruders will try to affect a disclosure and (perhaps) the most likely route to a correct claim. So the logic and variables used for the 31 claims should be summarised.

If Age – Sex – Marital Status (with Occupation) was used 15 times (1 correct), one should consider whether there should be some additional protection for one or more of those variables. These measures would normally be of greatest effect if targeted towards the variable(s) with greatest detail. For example, a reduction in the number of categories in occupation or marital status is likely to be of more impact on disclosure risk than a removal of the sex variable. Treatment may even be targeted further by considering the categories of marital status or occupation that gave rise to the most claims and collapsing those categories with appropriate others.

Note that any of the other variables (for example, industry, religion) might be used by the intruder to provide greater confidence as confirmatory variables, or even apparently refute a claim. For example, looking across the other variables, seeing that religion is “no religion” might be in conflict with private knowledge that the intruder holds on the intended target. In contrast, seeing that industry equals “manufacturing” may be consistent with private knowledge. From an intruder’s point of view, at regional level, one would imagine that the information in the dataset record would all have to be consistent (or at least not inconsistent) with private knowledge for an intruder to have high confidence.

Step 5: Data release

Given the low number of correct identifications compared with incorrect identifications it was decided that these microdata were safe enough to release as an open dataset. Documentation released along with the open data will include:

the standard OGL conditions
metadata to include summary of the variables and categories in the dataset
statement that disclosure control has been applied (and intruder testing carried out)
purposes of the dataset; to be used for teaching or training purposes in most cases
links to other suitable datasets if more detailed analysis is required

Send queries or comments to the Statistical Disclosure Control team: sdc.queries@ons.gov.uk.

Cookies on ons.gov.uk