This policy sets out the practices and principles that the Office for National Statistics (ONS) staff will follow when collecting data from websites to produce statistics and conduct statistical research, including exploratory research, which serves the public good.
For the purposes of this policy, web scraping is defined as the collection of data automatically retrieved from the internet. For a more detailed definition of web scraping, see the glossary in Appendix A.Back to table of contents
Use of alternative data sources is a key element of ONS’ strategy for delivering statistics, analysis and advice, which helps Britain make better decisions. Driven by this strategic imperative, ONS staff may use web scraping as an alternative data collection mechanism that can complement and improve traditional forms of data collection such as surveys. The purpose of this policy is to ensure that web scraping at ONS is carried out transparently, consistently, ethically, and respecting all relevant legislation.Back to table of contents
This policy is applicable to all ONS staff activities involving web scraping of non-personal/non-identifiable data. When obtaining or procuring web scraping services from a third party, ONS will seek to ensure that the overarching principles contained in this policy are met. The policy outlines key principles of web scraping and provides practical guidance. This policy does not cover the use of Application Programming Interfaces (APIs) (see Appendix A) and is not applicable to government departments outside ONS.Back to table of contents
Section 45 of the Statistics and Registration Service Act (SRSA) 2007, as inserted by Section 79 of the Digital Economy Act (DEA), permits any public authority, large and medium size enterprises, and charity organisations to disclose to the Statistics Board any information they hold in connection with their functions.
Web scraping is only conducted by ONS for the purposes of any one or more of its functions set out in the Statistics and Registration Service Act 2007 and Census Act 1920, which limit the functions of ONS to the production and publication of official statistics that serve the public good.
ONS will adopt the following overarching principles to guide our approach to web scraping:
- Minimise burden on website owners
- Respect the Robots Exclusion Protocol
- Abide by all applicable legislation and monitor the evolving legal situation
5.1 Minimise burden on website owners
ONS' web scraping will minimise burden on the owners of the website, consistent with the Code of Practice for Statistics. The practices we will follow include, where applicable:
Delaying accessing pages on the same domain
Adding idle time between requests
Limiting the depth of crawl within the given domain
When scraping multiple domains, parallelising the crawling operation to minimise successive requests to the same domain
Scraping at a time of day when the website is unlikely to be experiencing heavy traffic - for example, early in the morning or night
Optimising the web scraping strategy to minimise volumes of requests to domains
Only collecting parts of pages required for the purpose
If substantial amounts of data are extracted on a regular basis, this information would be communicated with website owners in initial contact (please see the detail in section 5.2). If web scraping lasts for longer than 3 months, ONS' Data Acquisition team will do the periodic review and inform website owners in writing.
The Data Acquisition team in Data as a Service (DaaS) is responsible for acquiring data from external suppliers to ONS and will maintain the central web scraping record (section 8.4), using it to optimise the web scraping strategy and to reduce burden on website owners.
5.2 Respect the Robots Exclusion Protocol
The ONS Data Acquisition team will contact website owners by email 3 weeks before web scraping activities commence, providing information on the purpose and scope of web scraping, during of the project, how to identify ONS' web scraper, contact details of ONS' Data Acquisition team, weblink to ONS' Web Scraping Policy, how to share data if they feel uncomfortable with web scraping, and how to opt out. The website owners will have 2 weeks to respond to our request. If no reply is received, ONS will take it as no objection and web scraping activities will commence after this period.
There may be an exception to this notice period, when there is a strong case on the basis of national interest. This case will be clearly explained in the first contact with website owners.
During web scraping, ONS will only visit publicly accessible parts of the sites. ONS will respect the Robots.txt file and will use it to navigate which parts of sites are allowed for access or not. To distinguish ONS from a visit by normal users, the User Agent String will state:
ONS is the operator of the web scraper
The contact email address of ONS' Data Acquisition team
The website link to this policy
Office for National Statistics
5.3 Abide by all applicable legislation and monitor the evolving legal situation
The ONS is fully committed to compliance with the Data Protection Act 2018, ensuring that all processing of data is fair, lawful and transparent. The policy also includes the contact details of Data Protection Officer and Information Commissioner's Office (see section 6) to provide an independent channel where the data subject can raise their concerns if required.
This policy is applicable to web scraping of non-personal and non-identifiable data. Personal data is defined as any information relating to an identified or identifiable natural person, who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identify of that natural person (GDPR Article 4 (1)).
ONS recognises that there may be ethical and legal issues related to scraping and using data which potentially identifies individuals. ONS respects S.39 of the Statistics and Registration Services Act on confidentiality of personal information, and ONS will we not disclose any data which are covered by this protection. All ONS staff who wish to web scrape should complete the ethics self-assessment form, which will be shared with the Data Ethics team. The Data Ethics team will refer projects to the National Statistician's Data Ethics Advisory Committee (NSDEC) in instances where ethical risks are identified as high.
The Data Acquisition team would continuously monitor compliance of web scraping activities and will consult ONS' legal team for advice on projects involving scraping sensitive data and for monitoring the evolving legal situation.
This section outlines the division of responsibility in regard to achieving policy compliance.Back to table of contents
|ONS staff who request web scraping||Complying with the web scraping policy|
|Consulting with the Data Acquisition team in Data as a Service (DaaS), ONS before commencing any web scraping activities|
|Data Acquisition, Data as a Service (DaaS)||Advising ONS staff on any alternative and/or existing data sources|
|Ensuring that web scrapers are fully compliant with the web scraping policy|
|Receiving and approving the web scraping request from ONS staff|
|Seeking advice from ONS Legal Services, NSDEC, and/or the Data Governance Committee (DGC) when needed|
|Engaging with the website owners opt-out requests and any enquiries|
|Keeping all records of ONS web scraping activities|
|Legal Services||Providing advice on current and evolving legal issues if required|
|Data Ethics team||Providing advice on ethical issues and refer any projects with high ethical risks to the National Statistician’s Data Ethics Advisory Committee (NSDEC)|
|National Statistician’s Data Ethics Advisory Committee (NSDEC)||Providing advice on ethical issues if required|
|Data Governance Committee (DGC)||Ensuring the consistent application of this policy to all ONS staff and assessing the organisational risk by conducting web scraping|
Download this table.xls .csv
All ONS staff as well as accredited researchers accessing, processing and sharing the Office for National Statistics (ONS) data must comply with this policy. The Data Governance Committee and Data as a Service will monitor this policy as applied to the business.
Any ONS staff who wish to web scrape should send a request form and ethics self-assessment form to the DaaS Data Acquisition team using the email address provided in this document and should not commence the web scraping activities until the approval is given.
Failure to comply may result in disciplinary action in line with the organisation’s Discipline Policy. Staff making a compliant in relation to the application of this policy should refer to the organisation’s Grievance Policy.Back to table of contents
|Policy owner||Data as a Service (DaaS)|
|Policy approval||Data Governance Committee (DGC)|
|Compliance Monitoring||Data Acquisition team, Data as a Service|
|Review and amendments||Data Governance Committee|
|Ethics||Data Ethics team|
Download this table.xls .csv
This policy will be reviewed on 31 March 2021.Back to table of contents
10.1 Web scraping steps
1. Web scraping request
ONS business areas wish to web scrape contact firstname.lastname@example.org
Each business area should complete the Web Scraping Request and ethics self-assessment form (see section 10.2)
Data as a Service evaluates the request based on the check list provided in Section 10.2
Data as a Service shares the completed ethics self-assessment form with the Data Ethics team for evaluation (see section 5.3)
3. Website owner contact
- Data as a Service will send an email to website owners providing detailed information on web scraping with 2 weeks' notice period (Section 5.2)
4. Set up web scraping
Web scraping commences if there is no written objection from website owners
Make sure that the user string agent contains the information listed in Section 5.2
5. Web scraping operation
Any requests from website owners will be dealt with promptly by the Data Acquisition team in Data as a Service as listed in Section 11
All web scraping activities are recorded in ONS' central web scraping record (see Section 10.4) and monitored by Data as a Service
10.2 Web Scraping Request Checklist
10.3 Point of Contact
Data Acquisition Officer, Data as a Service, Office for National Statistics, Government Buildings, Duffryn, Newport, Wales NP10 8XG
01633 455 055
Data Governance, Legislation and Policy, UK Statistics Authority
Data Protection Officer
Office for National Statistics, Segensworth Road, Titchfield, Fareham, Hampshire P015 5RR
0845 601 3034
Information Commissioner's Office
Wycliffe House, Water Lane, Wilmslow, Cheshire SK9 5AF
0303 123 1113
10.4 Central Web scraping record
All ONS staff who have used or plan to use web scraping should report activities to the Data Acquisition team in Data as a Service. This is to ensure that our web scraping operations are done to reduce burden on website owners and this record will help us optimise our web scraping strategy.Back to table of contents
11.1 What options do website owners have when they receive an initial request letter from ONS?
As stated in Section 5.2, ONS' Data Acquisition team will contact all website owners by email 3 weeks before web scraping commences. Website owners would also be given an option of sharing data with ONS by having a data sharing agreement. Upon receipt of such email, website owners can request further information by writing back to the sender email address and/or express your preference. Website owners have 2 weeks to refuse the request and if they wish to do so, website owners should reply to the sender email address by listing the reasons of why they refuse. If no reply is received 2 weeks after the email is sent, it would be taken as an approval and web scraping will start on the commencement data stated in the original email.
11.2 Would website owners be informed about the progress?
As stated in Section 5.1, if web scraping lasts longer than 3 months, ONS' Data Acquisition team will conduct a periodic review of the activity and will inform the website owners. Website owners can specifically request the review by writing to Data.Acquisition@ons.gov.uk.
11.3 What happens if a website owner wishes to opt out?
If website owners wish to opt out, they can do so by replying to the initial request email sent by Data.Acquisition@ons.gov.uk within 2 weeks of receiving the initial request. In writing they should state why they wish to opt out and ONS' team will engage with them promptly to address any concerns and to negotiate the next step, including other means of sharing the data.
If website owners wish to opt out once web scraping commences, they can do so by writing to Data.Acquisition@ons.gov.uk and the Data Acquisition team will engage with them promptly to solve any issues and/or to organise termination of the web scraping activities. Please note that time until termination may vary depending on operational or statistical requirements. ONS' team will explain such circumstances to website owners in writing.
11.4 What happens if ONS' web scraper is blocked from websites?
If the operation is blocked without being given any notice from website owners to ONS, the Data Acquisition team will contact the website owner for more information.Back to table of contents
Application Programming Interface (API)
In a web scraping context, an API can be built by a website owner to allow easy access to data from the website without having to build a web scraper from scratch
Depth of crawl
The extent to which a web-crawler crawls pages "deep" within the website. If a website's homepage is referred to as "level 0", and pages linked to from the homepage are "level 1" and pages linked to from level 1 pages are "level 2" and so on, then limiting the depth of crawl means limiting the level that the web-scrawler will penetrate to
When crawling multiple websites simultaneously, it is possible to "parallelise" the crawl by alternating between websites. For example, if website A and website B contain multiple pages, then a parallelised crawl might involve capturing a single page from website A, followed by a single page from website B, and continuing sequentially. A non-parallelised crawl might involve crawling the website A in its entirety, followed by website B in its entirety.
Robots Exclusion Protocol
A widely used protocol that allows website owners to prevent any web-scraping, to limit web scraping to search engines only, or to shield parts of their website from web scraping. For more details, see the robots.txt website.
User Agent String
When a browser or web scraper accesses a web page it provides a "user agent string" to the server hosting the website and this string is then viewable by the website owner. It is possible, when building a web scraping programme, to modify this user agent string so that it contains custom text, for example, to identify the operator or purpose of web scraping
The collection of data automatically from the internet by using a software or programBack to table of contents