1. Scope

This policy outlines the considerations and requirements of producing and using synthetic data for statistical research. The scope of this policy includes the production, use, dissemination and sharing of synthetic data at the Office for National Statistics (ONS). It relates to data owned by the ONS (where the ONS is the data controller) as well as data shared with the ONS (where the ONS is the data processor).

Back to table of contents

2. Background

To protect confidentiality of individuals, access to detailed statistical data is restricted. Restrictions may limit the ways in which data can be used, prevent the sharing of data, or require data to be accessed within a secure environment. These limits can inhibit research.

Synthetic data are artificial data that do not refer to real individuals (persons, households, businesses, or other statistical units), intended to enable greater access to and use of data. These could be microdata records, aggregates, or any other data types.

Synthetic data are typically produced from statistical models based on the real data in order to replicate certain data features but carrying a lower disclosure risk and potentially enjoying much greater access. Synthetic data can be used for testing, research, processing, and statistical purposes, particularly when access to or the sharing of real data is difficult.

This policy sets out considerations of using synthetic data in the Office for National Statistics (ONS), and the steps to follow when producing, sharing, or using synthetic data.

Back to table of contents

3. Policy statement

Synthetic data can be used as a substitute in testing or research when real data is inaccessible. It should be expected that synthetic data will not accurately reflect all properties of the real data.

Synthetic data should be produced in a way that is unlikely to reproduce disclosive aspects of real data. If a synthetic version is to be exported outside of a secure environment, or be shared more widely than the real data, then it must be checked for disclosure issues.

When working with data, disclosure risk can never be fully eliminated, but synthetic data can offer a lower risk alternative enabling greater data access.

Back to table of contents

4. Policy detail

Uses of synthetic data

Synthetic data can be used for any purpose, but it should be considered how well they will fit this purpose based on how they were produced. For example, simple creations could match the number of rows, columns, and file size which may provide a good estimate for how long code or a process would take to run. It could also be used for code or system development while access to the real data is arranged. More complicated methods may be able to preserve important statistical properties, for example, correct sizes of sub-population groups, and may provide sufficiently accurate analytical results.

Quality standards

Synthetic data will not preserve all features of the real data they represent. They should be expected to contain errors and differences. If high quality data is required, using the real data may be unavoidable.

Producing synthetic data

There are many different ways a synthetic dataset could be produced, any of which could be valid depending on the ultimate use of the data. An overview of several tools for producing synthetic data can be found in the synthetic data working paper.

The purpose of the synthetic data should be considered, as this could determine the method chosen. Depending on the use of synthetic data, there are few limitations on how they could be produced, though it must be in a way that is unlikely to accurately re-produce real data. For example, randomly sampled rows from a dataset would represent real individuals and would not be considered synthetic.

When producing synthetic data, some guidance should also be written outlining how the data were produced, and for which uses they could or could not be appropriate.

Sharing synthetic data

Synthetic versions of data can be shared with less restriction than real data, as they are likely to be much less disclosive.

How widely synthetic data can be shared is a decision for the data controller and the Information Asset Owner. They should make this decision based on how the synthetic data were produced, and the balance of perceived disclosure risk of the synthetic data versus the potential benefits of sharing.

If data are to be shared publicly, a detailed disclosure risk assessment must be carried out. It should be considered that once data were shared publicly, it would be hard to ensure all copies of the data were deleted. If the disclosure risk of synthetic data is unclear, the disclosure control expert group can be contacted for consultation.

Data can be shared under the condition they are used for a specific purpose, or if appropriate they could be shared publicly for any purpose.

Back to table of contents

5. Roles and responsibilities

Information Asset Owner

The Information Asset Owner will decide how widely the synthetic data can be shared, and under which conditions in line with the agreement with the data controller(s).

Statistical Disclosure Control Expert Group

The disclosure control expert group can be contacted at sdc.queries@ons.gov.uk to provide advice on assessing disclosure risk of synthetic data if required.

Office for National Statistics Legal Services

Legal Services can provide advice on current and evolving legal issues if required.

Data Ethics team

The Data Ethics team can provide advice on ethical issues and refer any projects with high ethical risks to the National Statistician's Data Ethics Advisory Committee (NSDEC).

National Statistician's Data Ethics Advisory Committee

The National Statistician's Data Ethics Advisory Committee (NSDEC) will provide advice on ethical issues if required.

Back to table of contents