1. Web-scraping Policy

Date in force: 1 October 2017
Data expires: N/A
Last review date: 12 September 2017
Next review date: 12 September 2018
Responsible Officer: Sarah Henry
Author: Matthew Greenaway
Approved by: Data Governance Committee
Scope: Office for National Statistics
Policy Owner: Head of Commercial Data team, Data as a Service
Version No: v.7.6
Main contact for queries: matthew.greenaway@ons.gsi.gov.uk
Back to table of contents

2. Revision history

This policy document will be reviewed annually. Table 1 provides a summary of the changes made to this Web-scraping policy document.

Back to table of contents

3. Introduction

This policy sets out the practices and procedures that Office for National Statistics (ONS) staff will follow when carrying out web-scraping or using web-scraped data.

For the purposes of this policy, web-scraping is defined as the collection of data automatically from the internet. For a more detailed definition of web-scraping, see the glossary in Appendix A.

Back to table of contents

4. Background

We are committed to using new sources of data to produce statistics, analysis and advice, which help Britain make better decisions. Driven by this strategic imperative, Office for National Statistics (ONS) staff may use web-scraping as a data collection mechanism.

Illustrative web-scraping applications in an official statistics context are provided in the following examples, although not all of these applications are currently being taken forwards at ONS. These applications are referred to throughout this document to help contextualise and clarify this policy.

Example 1

Data: food price data scraped from supermarket websites. Use: to produce timely measures of food-price inflation.

Example 2

Data: jobs vacancy data scraped from jobs portals. Use: to produce timely jobs vacancy statistics and provide a richer source of labour market information.

Example 3

Data: data related to second or holiday homes scraped from holiday lettings and room-sharing websites. Use: to help inform census and social survey design and estimation.

Example 4

Data: text scraped from large numbers of business websites. Use: to produce research and statistics on the digital economy, for example, research into business classification systems.

Back to table of contents

5. Scope

This policy is applicable to all Office for National Statistics (ONS) staff activities involving web-scraping or which use web-scraped data. When obtaining or procuring web-scraping services from a third party, ONS will seek to ensure that the main principles set out in section 6 have been followed. The policy does not cover the use of Application Programming Interfaces (APIs) and is not applicable to government departments outside of ONS. For more detail about APIs, see the glossary in Appendix A.

Back to table of contents

6. Objectives

The purpose of this policy is to ensure that web-scraping at Office for National Statistics (ONS) is carried out transparently, consistently, ethically, and respecting all relevant legislation; and that web-scraped data are used in an appropriate and ethical manner.

Back to table of contents

7. Principles

Office for National Statistics (ONS) will use web-scraped data solely for the purpose of producing statistics and analysis with clear benefit for users. Our overarching principle is to maximise this benefit for users while minimising the risk and potential impacts of scraping.

To ensure we achieve this, ONS will adopt the following principles when web-scraping:

  • seek to minimise burden on website owners
  • honour requests made by website owners to refrain from scraping their website
  • protect all personal data in all statistics and research outputs and seek ethical advice when scraping data that may identify individuals
  • apply scientific principles in the production of statistics and research based on web-scraped data and consider other sources of data
  • abide by all applicable legislation and monitor the evolving legal situation

Section 7 sets out the practices we will follow to ensure we respect these principles. The procedure that ONS staff will follow when carrying out web-scraping is set out in Appendix B.

Back to table of contents

8. Practices

Seek to minimise burden on website owners

Office for National Statistics (ONS) web-scraping will minimise any burden on the owners of the website, consistent with the Code of Practice for Statistics. The practices we will follow include, where applicable:

  • delaying accessing pages on the same domain
  • limiting the depth of crawl within the given domain
  • when scraping multiple domains, parallelising the crawling operation to minimise successive requests to the same domain
  • avoiding scraping potentially sensitive areas of a website
  • scraping at a time of day when the website is unlikely to be experiencing heavy traffic – for example, early in the morning for a UK-based website
Example – scraping supermarket websites

The ONS supermarket web-scrapers are set to run at 5am and have a delay of one second between accessing pages. They also identify ONS as the operator of the scraper in the user-agent string.

Honour requests made by website owners to refrain from scraping their website

ONS will cease scraping if the website owner contacts us and asks for us to do so. For a website owner to be able to identify ONS as an operator of a scraper, we will identify ourselves in the user-agent string. For regular, long-term scraping applications that are embedded in our statistical production systems, ONS scrapers will also provide a link in the user-agent string to a website containing more detail on:

  • information about what the web-scraper collects and why it's collecting it
  • information on how to contact the team that operates the scraper
  • information on how to opt-out and have any data collected deleted
  • whether ONS will share web-scraped data for the purposes of statistics and research, and with whom the data may be shared with

In addition, ONS will always respect the robots.txt protocol – a widely-used protocol that allows website owners to indicate whether they are happy for their website to be scraped (for more details, see the glossary in the Appendix A). We will respect the wishes of website owners as set out in the terms and conditions of websites whenever it is practical for us to check those terms – more detail is provided later in this section

Example – scraping holiday lettings and room-sharing websites

Some home-sharing websites indicate in their robots.txt file that they do not allow web-scraping. We will not scrape these data.

Protect all personal data in all statistics and research outputs

ONS is fully committed to compliance with the Data Protection Act 1998 and to following good practice. Data protection is important, not only because it is critical to the work of the organisation, but also because it is about protecting individual privacy and maintaining confidence.

To ensure compliance with the data protection act, ONS will not disclose any personal data, which includes all data that identifies or can identify individuals. In addition, we recognise that there are ethical issues related to scraping and using data that identifies individuals and we will consult the National Statistician’s Data Ethics Advisory Committee (NSDEC) for ethical advice on projects involving scraping these data.

Example – data scraped from holiday lettings and room-sharing websites

Some of these data may relate to individuals – for example, listings by individuals on a room-sharing website may contain name and address information. These data would be defined as personal data according to the Data Protection Act and if ONS scraped these data, we would handle it in accordance with the Act. We would also seek ethical advice from NSDEC prior to scraping these data.

Section 39 paragraph 1 of the Statistics and Registration Service Act 2007 states that data which identify businesses or “bodies corporate” and which have not been lawfully made public should also be treated as personal data. We will not disclose any data that are covered by this protection. We will judge whether data have been lawfully made public by referring to any clauses in the terms and conditions of the relevant websites pertaining to how we should treat any scraped data.

Example – scraping jobs vacancy data

Jobs portals may have clauses in their terms and conditions pertaining to how we should treat any scraped data. ONS will respect these clauses, which may imply that the data should be treated as personal data according to Statistics and Registration Service Act.

ONS may consider sharing web-scraped data with other public sector or academic organisations solely for the purpose of producing statistics, analysis and advice in the public good. We will not share any data that is classified as personal data under the Statistics and Registration Service Act and will abide by any clauses in the terms and conditions of websites pertaining to the sharing of the scraped data.

Apply scientific principles in the production of statistics and research based on web-scraped data and consider other sources of data

There are significant methodological challenges involved in producing fit-for-purpose statistics and research using web-scraped data. Prior to web-scraping, ONS will investigate all possible data sources, potentially including commercial data, administrative data, survey data, and Application Programming Interfaces (APIs).

We will only use web-scraped data if it is the best option considering data quality, timeliness, legal issues and any other relevant criteria. If we do use web-scraped data, we will be transparent about any quality issues that this may cause in the resulting statistics or research.

Abide by all applicable legislation and monitor the evolving legal situation

The legal situation surrounding web-scraping in general, and web-scraping at a National Statistics Institute in particular, is complex and is still evolving; and there are relatively few relevant legal precedents. ONS will take the following approach:

  • we will cease scraping whenever we are asked to do so by the website owner
  • we will carry out scraping in a manner that does not cause financial detriment to any website owner
  • we will abide by the Data Protection Act and other data sharing legislation, as outlined previously; this includes ensuring that personal data are not revealed in any published statistics or research
  • we will check and abide by the terms and conditions of websites wherever it is practical for us to do so and contact website owners in the event of any uncertainty regarding the terms and conditions
  • where it is not practical for us to check the terms and conditions of a website, for example, where we are scraping large numbers of websites, we may scrape where we can justify that it is ethical and in the public good for us to do so; this decision will be made with reference to the balance between risk and negative consequence and efficacy and public benefit
  • we will continue to monitor the legal situation as it evolves and amend our approach accordingly
Example – scraping supermarket websites

It is practical to check terms and conditions of these websites, since there are only a small number of them, and we will therefore check and abide by the terms and conditions. In the case of any uncertainty in these terms and conditions, we will contact the website owners.

Example – scraping data from large numbers of business websites

It is not feasible to check the terms and conditions for all websites in this scenario. We may, subject to the considerations set out previously, scrape without checking terms and conditions. We would continue to follow the other guidelines in this policy. In particular, we would respect the robots.txt exclusion protocol and cease scraping any website when asked do so.

Back to table of contents

9. Roles and responsibilities

Table 2 provides details of the roles and responsibilities for those involved in web-scraping data.

Back to table of contents

10. Compliance

All staff as well as researchers in the wider statistical research community accessing, processing and sharing Office for National Statistics (ONS) data must comply with this policy.

The Data Governance Committee and the Commercial Data team, supported by the web data group, will monitor this policy as applied to the business. The National Statistician’s Data Ethics Advisory Committee (NSDEC) will also monitor the application of the policy to ensure that all projects approved by NSDEC comply with it.

There are exceptions to this policy for small-scale exploratory web-scraping projects.

Failure to comply may result in disciplinary action in line with the organisation’s Discipline Policy.

Staff making a complaint in relation to the application of this policy should refer to the organisation’s Grievance Policy.

Back to table of contents

11. Governance

Policy owner: Head of the Commercial Data team in Data as a Service
Policy approval: Data Governance Committee (DGC)
Compliance Monitoring: Web Data Group
Review and amendments: Data Governance Committee (DGC)
Back to table of contents

12. Appendix A: Glossary

Application Programming Interface (API)

In a web-scraping context, an API can be built by a website owner to allow easy access to data from the website without having to build a web-scraper from scratch.

Depth of crawl

The extent to which a web-crawler crawls pages “deep” within the website. If a website’s homepage is referred to as “level 0”, and pages linked to from the homepage are “level 1” and pages linked to from level 1 pages are “level 2” and so on, then limiting the depth of crawl means limiting the level that the web-crawler will penetrate to.

E-commerce

Commercial transactions conducted electronically on the internet, for example, online shopping.

Parallelised crawling

When crawling multiple websites simultaneously, it is possible to “parallelise” the crawl by alternating between websites. For example, if website A and website B each contain multiple pages, then a parallelised crawl might involve capturing a single page from website A, followed by a single page from website B, and continuing sequentially. A non-parallelised crawl might involve crawling the website A in its entirety, followed by website B in its entirety.

Robots.txt exclusion protocol

A widely-used protocol that allows website owners to prevent any web-scraping, to limit web-scraping to search engines only, or to shield parts of their website from web-scraping. For more details, see the robots.txt website.

User-agent string

When a browser or web-scraper accesses a web-page it provides a “user-agent string” to the server hosting the website and this string is then viewable by the website owner. It is possible, when building a web-scraper, to modify this user-agent string so that it contains custom text – for example, to identify the operator or purpose of this web-scraper.

Web-scraper

The software or program used to carry out web-scraping.

Web-scraping

The collection of data automatically from the internet.

Back to table of contents

13. Appendix B: Web-scraping process

Figure 1 provides a detailed description of the end-to-end process of web-scraping data from websites.

Back to table of contents