The coronavirus (COVID-19) pandemic is having a profound impact across the UK. In response to the pandemic, the COVID-19 Infection Survey measures:
how many people across England, Wales, Northern Ireland and Scotland test positive for COVID-19 infection at a given point in time, regardless of whether they report to experiencing symptoms
the average number of new positive test cases per week over the course of the study
the number of people who test positive for antibodies, to indicate how many people are ever likely to have had the infection or have been vaccinated
The results of the survey contribute to the Scientific Advisory Group for Emergencies (SAGE) estimates of the rate of transmission of the infection, often referred to as "R". The survey also provides important information about the socio-demographic characteristics of the people and households who have contracted COVID-19.
The Office for National Statistics (ONS) is working with the University of Oxford, University of Manchester, Public Health England, Wellcome Trust, IQVIA and the Lighthouse laboratories at Glasgow and the UK Biocentre Milton Keynes to run the study, which was launched in mid-April 2020 as a pilot in England. We have expanded the size of the sample over August to October 2020 and now are reporting headline figures for all four UK nations.
This methodology guide is intended to provide information on the methods used to collect the data, process it, and calculate the statistics produced from the COVID-19 Infection Survey. We will continue to expand and develop methods as the study progresses, and we will publish an updated methodology guide when needed.
It can be read alongside:
the COVID-19 bulletin, which gives weekly headline statistics
the study protocol, which outlines the study design and rationale
At the start of the pilot study at the end of April 2020, the sample for the survey was drawn mainly from the Annual Population Survey (APS), which consists collectively of those who successfully completed the last wave of the Labour Force Survey (LFS) or local LFS boost, and who had consented to future contact regarding research.
Around 38,000 households respond to the LFS each quarter and it is the largest regular household survey in the UK. The sampling frame for the LFS is the Postal Address File of small users, which contains approximately 26 million addresses. Only private households are included in the sample. People living in care homes, other communal establishments and hospitals are not included. Only private households in the UK are included in the study.
At the start of the pilot stage of the study, we invited about 20,000 households in England to take part, anticipating that this would result in approximately 21,000 individuals from approximately 10,000 households participating. From the end of May 2020 through to July 2020, additional households have been invited to take part in the survey each week (roughly 5,000 a week).
At the pilot stage of the study in England, all respondents to the COVID-19 Infection Survey were individuals who have previously participated in an Office for National Statistics (ONS) social survey, which means the number of ineligible addresses in the sample is substantially reduced. To take part, invited households opted into the survey by contacting IQVIA, a company working on behalf of the ONS, to arrange a visit.
Since the end of July 2020, we have further expanded the survey to invite a random sample of households from the AddressBase, which is a commercially available list of addresses maintained by the Ordnance Survey.
In line with our plans to increase our overall sample size, we prioritised some specific areas under government local restriction due to an outbreak of COVID-19. We invited 40,000 extra households from 14 selected local authorities in Greater Manchester, Lancashire and West Yorkshire to participate in this study.
We also boosted our sample in London inviting 50,000 extra households to increase the household involvement rates in this area.
In August 2020 we announced our plans to further expand the study with the aim of increasing from 28,000 people tested per fortnight in England to 150,000 people tested per fortnight, and 15,000 in each of Wales and Scotland, and up to 15,000 in Northern Ireland by October 2020 until March 2021. The sample for this expansion was invited from a random sample of households from the AddressBase.
Following the expansion of fieldwork in England, coverage of the study was extended to include Wales, Northern Ireland and Scotland. Survey fieldwork in Wales began on 29 June 2020, with a sample of 17,329 households who had participated in other ONS studies and agreed to be contacted regarding future research. Since the beginning of October 2020, we have expanded the survey to invite a further 37,845 households randomly sampled from AddressBase. Since 7 August 2020, we have reported headline figures for Wales.
Survey fieldwork began in Northern Ireland on 26 July 2020, with a sample of 14,156 households that had participated in ONS and NISRA surveys and had agreed to be contacted regarding future research. Since 25 September 2020 we have reported headline figures for Northern Ireland.
In Scotland, fieldwork began on 21 September 2020, with an initial sample larger than the initial sample size used in England and Wales. As of 31 October 2020, 137,255 households sampled from a list of addresses, were invited to participate in the survey. Estimates for Scotland were first published on 30 October 2020.
We include children over the age of 2 years, adolescents and adults in the survey. Children are included because it is essential to understand the prevalence and the incidence of symptomatic and asymptomatic infection in children. This is particularly important for informing policy decisions around schools. Further information on the prevalence of coronavirus (COVID-19) in schools can be found in our latest release from the COVID-19 Schools Infection Survey.
Additionally, 20% of adults over 16 years old surveyed within our household sample were asked to provide a blood sample. This is used to test for the presence of antibodies to the coronavirus (COVID-19).
More information about how participants are sampled can be found in the study protocol.
Likelihood of enrolment decreases over time since the original invitation letter was sent, and response rate information for those initially asked to take part at the start of the survey in England can be considered as relatively final. We provide response rates separately for the different sampling phases of the study. These response rates can be found in the datasets provided alongside our weekly bulletin.
Bulletin table 7a
Provides a summary of the total number of households registered and eligible individuals in registered households for the UK.
Bulletin table 7b
Provides a summary of the response rates for England, by the different sampling phases of the survey:
Table A presents response rates for those asked to take part at the start of the survey, sampled from previous ONS studies
Table B presents response rates for those invited from the end of May 2020, sampled from previous ONS studies
Tables A and B can be considered as relatively final as the likelihood of enrolment decreases over time:
- Table C presents response rates for those invited from the end of July 2020, from a randomly sampled list of addresses, where enrolment is continuing
Bulletin table 7c
Provides a summary of the response rates for Wales by the different sampling phases of the survey:
Table A presents response rates for those invited from the end of June 2020, sampled from previous ONS studies
Table B presents those asked to take part from the beginning of October 2020, from a randomly sampled list of addresses
Bulletin table 7d
Provides a summary of the response rates for Northern Ireland.
Bulletin table 7e
Provides a summary of the response rates for Scotland.
Bulletin table 7f
Provides information on the number of swabs taken per day since the study began.
Note response rates from different sampling phases are not comparable. These response rates cannot be regarded as final since those who are invited are not given a time limit in which to respond, and because we aim to recruit households continuously to meet our fortnightly targets (rather than recruit everyone who registers immediately). For up to date information on our response rates, please see our most recent bulletin.
To produce reliable and generalisable estimates, it is desirable that the survey sample reflects the diversity of the population under investigation. For this reason, it is important we retain sample members who agree to participate for the duration of the study. For various reasons, some sample members are unreachable, withdraw their participation or drop out of the study after the first or second visit. If those who drop out of the sample are significantly different from those who remain, it will affect researchers' ability to produce estimates that are generalisable to the target population. On the Coronavirus (COVID-19) Infection Survey, we monitor the number of people who drop out of the sample to mitigate the potential risks caused by attrition. Attrition rates for those who have provided an initial swab can be worked out using the response rate tables discussed previously. Attrition in the study has never been more than 5%.Back to table of contents
Nose and throat swab
We ask everyone aged 2 years or older in each household to have a nose and throat swab. Those aged 12 years and older take their own swabs using self-swabbing kits, and parents or carers use the same type of kits to take swabs from their children aged between 2 and 11 years old. This is to reduce the risk to the study health workers and to respondents themselves. We take swabs from all households, whether anyone is reporting symptoms or not.
We need to know more about how the virus is transmitted in individuals who test positive on nose and throat swabs; whether individuals who have had the virus can be re-infected symptomatically or asymptomatically; and about incidence of new positive tests in individuals who have not been exposed to the virus before.
To address these questions, we collect data over time. Every participant is swabbed once; participants are also invited to have repeat tests every week for the first five weeks as well as monthly for a period of 12 months in total.
We take swabs to detect the coronavirus (COVID-19) so we can measure the number of people who are infected. To do this, laboratories use real-time reverse transcriptase polymerase chain reaction (RT-PCR) (PDF, 731KB).
Swabs are tested for three genes present in the coronavirus: N protein, S protein and ORF1ab. Each swab can have one, two or all three genes detected. In the laboratories used in the survey, RT-PCR for three SARS-CoV-2 genes (N protein, S protein and ORF1ab) uses the Thermo Fisher TaqPath RT-PCR COVID-19 Kit, analysed using UgenTec FastFinder 3.300.5, with an assay-specific algorithm and decision mechanism that allows conversion of amplification assay raw data from the ABI 7500 Fast into test results with minimal manual intervention. Samples are called positive if at least a single N-gene and/or ORF1ab are detected (although S-gene cycle threshold (Ct) values are determined, S-gene detection alone is not considered sufficient to call a sample positive). We estimate a single Ct value as the arithmetic mean of Ct values for genes detected (Spearman correlation >0.98 between each pair of Ct values). More information on how the swabs are analysed can be found in the study protocol.
In mid-November 2020, a new variant of the coronavirus (COVID-19) was identified in the UK. This COVID-19 variant (also called "B.1.1.7") has changes in one of the three genes which coronavirus swab tests detect, known as the S-gene. This means in cases compatible with this variant, the S-gene is no longer detected by the current test. When there is a high viral load (for example, when a person is most infectious) absence of the S-gene in combination with the presence of the other two genes (ORF1ab- and N-genes) is a reliable indicator of this variant in COVID-19. However, as the viral load decreases (for example, if someone is near the end of their recovery from the infection), the absence of the S-gene is a less reliable indicator of this variant. We have reported on the percentage of people testing positive compatible with this variant since 24 December 2020. Further information on the current percentage of people testing positive compatible with this variant can be found in our latest bulletin.
RT-PCR from nose and throat swabs may be falsely negative, because of their quality or the timing of collection. The virus in nose and throat secretions peak in the first week of symptoms but may decline below the limit of detection in patients who present with symptoms beyond this time frame. For people who have been infected and then recovered, the RT-PCR technique provides no information about prior exposure or immunity. To address this, we also collect blood samples to test for antibodies (see following section).
To capture data about people who have had COVID-19 but have since recovered, we aim to ask adults aged 16 years or older from 20% of enrolled households to also give a sample of blood, using one of two methods: 5 millilitres venous blood will be drawn by a healthcare professional or 0.5 millilitres using a capillary finger prick method undertaken by the participant and demonstrated by a specially trained fieldworker. The blood samples are taken at enrolment and then every month.
Blood samples are tested for antibodies using an assay for IgG immunoglobins against the spike (S) protein, which are produced to fight the virus, irrespective of symptoms. More information on the methods around this antibody assay can be found in a study comparing its performance with four other assays. From March 2021, we will also be testing samples for IgG immunoglobulins against the nucleocapsid (N) protein as well.
Where an individual in a household has symptoms compatible with COVID -19 infection, or is currently self-isolating or shielding, a blood sample will be taken using the finger prick method administered by the participant.
We collect information from each participant, including those under 16 years of age, about their socio-demographic characteristics, any symptoms that they are experiencing, whether they are self-isolating, their occupation, how often they work from home, and whether the participant has come into contact with a suspected carrier of COVID-19. In recent months, new questions have been added to the participant questionnaire to gather additional information on participant's experiences of the pandemic. New questions cover a range of topics, such as long COVID, whether participants have been vaccinated, travelling to work, how easy it is to maintain social distancing, and whether participants smoke. See the Coronavirus Infection survey questionnaire.
Notes for Study design: data we collect
Konrad R, Eberle U, Dangel A, and others. Rapid establishment of laboratory diagnostics for the novel coronavirus SARS-CoV-2 in Bavaria, Germany, February 2020. Euro Surveill 2020; 25(9).
To KK, Tsang OT, Leung WS, and others. Temporal profiles of viral load in posterior oropharyngeal saliva samples and serum antibody responses during infection by SARS-CoV-2: an observational cohort study. Lancet Infect Dis 2020.
Li Z, Yi Y, Luo X, and others. Development and clinical application of a rapid IgM-IgG combined antibody test for SARS-CoV-2 infection diagnosis. J Med Virol 2020.
Zhou P, Yang XL, Wang XG, and others. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 2020; 579(7798): 270-3
National COVID Testing Scientific Advisory Panel. Antibody testing for SARS-CoV-2 using ELISA and lateral flow immunoassays. MedRvix 2020.
The National SARS-CoV-2 Serology Assay Evaluation Group. Head-to-head benchmark evaluation of the sensitivity and specificity of five immunoassays for SARS-CoV-2 serology on more than 1,500 samples.
The nose and throat swabs are sent to the Lighthouse laboratories at Glasgow and the National Biosample Centre at Milton Keynes. Here, they are tested for SARS-CoV-2 using reverse transcriptase polymerase chain reaction (RT-PCR). This is an accredited test that is part of the national testing programme. Swabs are discarded after testing. The virus genetic material from positive samples is sent for whole genome sequencing in Oxford, through COG-UK, to find out more about the different types of virus circulating in the UK.
Blood tubes are kept in a cool bag during the day, and then sent to the University of Oxford overnight. Blood is tested for antibodies using an ELISA for immunoglobulins IgG, based on tagged and purified recombinant SARS-CoV-2 trimeric spike protein. From March 2021, we will also be testing samples for IgG immunoglobulins against the nucleocapsid (N) protein as well. Residual blood samples will be stored by the University of Oxford after testing where consent is given for this.
More information about swab and blood sample procedure and analysis can be found in the study protocol.Back to table of contents
Understanding false-positives and false-negative results
The estimates provided in the Coronavirus (COVID-19) Infection Survey bulletin are for the percentage of the private-residential population testing positive for the coronavirus (COVID-19), otherwise known as the positivity rate. We do not report the prevalence rate. To calculate the prevalence rate, we would need an accurate understanding of the swab test's sensitivity (true-positive rate) and specificity (true-negative rate).
Our data and related studies provide an indication of what these are likely to be. To understand the potential impact, we have estimated what prevalence would be in two scenarios using different possible test sensitivity and specificity rates.
Test sensitivity measures how often the test correctly identifies those who have the virus, so a test with high sensitivity will not have many false-negative results. Studies suggest that sensitivity may be somewhere between 85% and 98%. A recent study considering tests in the Lighthouse labs estimated that this is most likely to be around 95%.
Our study involves participants self-swabbing under the supervision of a study healthcare worker. It is possible that some participants may take the swab incorrectly, which could lead to more false-negative results. However, research suggests that self-swabbing under supervision is likely to be as accurate as swabs collected directly by healthcare workers.
Test specificity measures how often the test correctly identifies those who do not have the virus, so a test with high specificity will not have many false-positive results.
We know the specificity of our test must be very close to 100% as the low number of positive tests in our study over the summer of 2020 means that specificity would be very high even if all positives were false. For example, in the six-week period from 31 July to 10 September 2020, 159 of the 208,730 total samples tested positive. Even if all these positives were false, specificity would still be 99.92%.
We know that the virus was still circulating at this time, so it is extremely unlikely that all these positives are false. However, it is important to consider whether many of the small number of positive tests we do have might be false. There are two main reasons we do not think that is the case.
Symptoms are an indication that someone has the virus; but are reported in a minority of participants at each visit. We might expect that false-positives would not report symptoms or might report fewer symptoms (because the positive is false). Overall, therefore, of the positives we find, we would expect to see most of the false-positives would occur among those not reporting symptoms. If that were the case, then risk factors would be more strongly associated with symptomatic infections than without reported symptoms infections. However, in our data the risk factors for testing positive are equally strong for both symptomatic and asymptomatic infections.
Assuming that false-positives do not report symptoms, but occur at a roughly similar rate over time, and that amongst true-positives the ratio with and without symptoms is approximately constant, then high rates of if false-positives would mean that, the percentage of individuals not reporting symptoms among those testing positive would increase when the true prevalence is declining because the total prevalence is the sum of a constant rate of false-positives (all without symptoms) and a declining rate of true-positives (with a constant proportion with and without symptoms).
More information on sensitivity and specificity is included in Community prevalence of SARS-CoV-2 in England: Results from the ONS Coronavirus Infection Survey Pilotby the Office for National Statistics' academic partners. You can find additional information on cycle thresholds in a paper written by our academic partners at the University of Oxford.
The impact on our estimates
We have used Bayesian analysis to calculate what prevalence would be in two different scenarios, one with medium sensitivity and the other with low sensitivity. Table 1 shows these results alongside the weighted estimate of the percentage testing positive in the period from 6 September to 19 September 2022.
Scenario 1 (medium sensitivity, high specificity)
|Reference period: 6 to 19 September 2020||95% credible interval|
|Estimated average percentage of the population who had COVID-19 (weighted)||0.22%||0.18%||0.26%|
|Prevalence rate in Scenario 1 (medium sensitivity, high specificity)||0.22%||0.17%||0.29%|
|Prevalence rate in Scenario 2 (low sensitivity, high specificity)||0.34%||0.24%||0.49%|
Download this table Table 1: The effects of test sensitivity on estimates.xls .csv
Based on similar studies, the sensitivity of the test used is plausibly between 85% and 95% (with around 95% probability) and, as noted earlier, the specificity of the test is above 99.9%.
Scenario 2 (low sensitivity, high specificity)
To allow for the fact that individuals are self-swabbing, Scenario 2 assumes a lower overall sensitivity rate of on average 60% (or between 45% and 75% with 95% probability), incorporating the performance of both the test itself and the self-swabbing. This is lower than we expect the true value to be for overall performance but provides a lower bound.
The results show that when these estimated sensitivity and specificity rates are taken into account, the prevalence rate would be slightly higher but still very close to the main estimate presented in Section 2 of the Coronavirus (COVID-19) Infection Survey bulletin. This is the case even in Scenario 2, where we use a sensitivity estimate that is lower than we expect the true value to be. For scenario 2, prevalence is higher because this scenario is based on an unlikely assumption that the test misses 40% of positive results. For this reason, we do not produce prevalence estimates for every analysis, but we will continue to monitor the impacts of sensitivity and specificity in future.
Evaluation of the test sensitivity and specificity of five immunoassays for SARS-CoV-2 serology, including the ELISA assay used in our study, has shown that this assay has sensitivity and specificity (95% confidence interval) of 99.1% (97.8 to 99.7%) and 99.0% (98.1 to 99.5%) respectively; compared with 98.1% (96.6 to 99.1%) and 99.9% (99.4 to 100%) respectively for the best performing commercial assay.Back to table of contents
As in any survey, some data can be incorrect or missing. For example, participants and interviewers sometimes misinterpret questions or skip them by accident. It is important to run a pilot before running a full survey, so that the survey instrument can be improved. To minimise the impact of incorrect or missing data, we clean the data, by editing or removing data that are clearly incorrect.Back to table of contents
The primary objective of the study is to estimate the number of people in the population who test positive for coronavirus (COVID-19) on nose and throat swabs, with and without symptoms.
The analysis of the data is a collaboration between the Office for National Statistics (ONS) and researchers from the University of Oxford and University of Manchester, Public Health England and Wellcome Trust. Our academic collaborators aim to publish an extended account of the modelling methodology outside the ONS bulletin publication in peer-reviewed articles: examples include an article on community prevalence of SARS-CoV-2 in England Cycle threshold (Ct) values and positivity, and a paper on the new UK variant identified in mid-November 2020.
All estimates presented in our bulletins are provisional results. As swabs are not necessarily analysed in date order by the laboratory, we have not yet received test results for all swabs taken on the dates included in this analysis. Estimates may therefore be revised as more test results are included.Back to table of contents
We use a number of different modelling techniques to estimate the number of people testing positive for SARS-CoV-2, the virus that causes the coronavirus (COVID-19) disease, broken down by different characteristics (age, region and so on). Further information on our modelling techniques are provided in this section.
Bayesian multi-level regression poststratification (MRP) model
A Bayesian multi-level regression post-stratification (MRP) model is used to produce our headline estimates of positivity on nose and throat swabs for each UK country as well as our breakdowns of positivity by region and age group in England. This produces estimated daily rates of people testing positive for COVID-19 controlling for a number of factors described in this section. This technique is also used by organisations such as the Centers for Disease Control and Prevention (CDC) to provide prevalence of diseases at both a national and subnational level in the United States. Details about the methodology are also provided in the peer-reviewed paper from our academic collaborators published in the Lancet Public Health.
As the number of people testing positive (known as the positivity rate) is unlikely to follow a linear trend, time measured in days is included in the model using a non-linear function (thin-plate spline). Time trends are allowed to vary between regions by including an interaction between region and time. Given the low number of positive cases, the effect of time is not allowed to vary by other factors.
The models for the positivity rate for each country use all available swab data from participants from their respective country within time periods to estimate the number of people who currently have SARS-CoV-2. A Bayesian multi-level generalised additive model with a complementary log-log link was used.
The COVID-19 infection survey is based on a nationally representative survey sample; however, some individuals in the original Office for National Statistics (ONS) survey samples will have dropped out and others will not have responded to the survey. To address this and reduce potential bias, the regression models initially adjusted the survey results to be more representative of the overall population in terms of age, sex and region (region was only adjusted for in the England model). After a further assessment of the study sample, we updated the headline positivity models for England, Wales and Scotland to be more representative in terms of ethnicity. The Northern Ireland headline positivity model did not need to be adjusted, as the sample of Northern Ireland is the most representative in terms of ethnicity. The regression models do not adjust for household tenure or household size.
The data that are modelled are drawn from a sample, and so there is uncertainty around the estimates that the model produces. Because a Bayesian regression model was used, we present estimates along with credible intervals. These 95% credible intervals can be interpreted as there being a 95% probability that the true value being estimated lies within the credible interval. Again, a wider interval indicates more uncertainty in the estimate.
Sub-regional estimates for England were first presented on 20 November 2020 and for Wales, Northern Ireland and Scotland on 19 February 2021. As sample sizes varied in local authorities, we pooled local authorities together to create COVID-19 Infection Survey sub-regions. Sub-regional estimates are obtained from a spatial-temporal MRP model. This is on a similar basis to the dynamic Bayesian MRP model used for national and regional trend analysis that produces estimated daily rates of people testing positive for COVID-19 controlling for age and sex within sub-regions. Spatial-temporal in this context means the model borrows strength geographically and over time, meaning that the model implicitly expects rates to be more similar in neighbouring areas, and within an area over time. For our sub-regional analysis, we run two models: one for Great Britain and the other for Northern Ireland. Our academic partners from the University of Oxford have developed this spatiotemporal MRP methodology outside the ONS bulletin publication in a peer-reviewed article.
Initially for England, sub-regional estimates were produced using three-day groupings aggregated to a six-day period. However, because of falling numbers of positive cases and smaller sample sizes in some sub-regions, we have changed to seven-day groupings to provide more accurate estimates for all countries of the UK and were presented for the first time on 12 February 2021.
Age analysis by category for England
"age 2 years to school Year 6" includes those children in primary school and below
"school Year 7 to school Year 11" includes those children in secondary school
"school Year 12 to age 24 years" includes those young adults who may be in further or higher education
age 25 to age 34 years
age 35 to age 49 years
age 50 to age 69 years
age 70 years and above
Our current age categories separate children and young people by school age. This means that 11- to 12-year-olds have been split between the youngest age categories depending on whether they are in school Year 6 or 7 (birthday before or after 1 September). Similarly, 16- to 17-year-olds are split depending on whether they are in school Years 11 or 12 (birthday before or after 1 September). Splitting by school year rather than age at last birthday reflects a young person's peers and therefore more accurately reflects their activities both in and out of school.
The model used to produce our daily estimates by age category for England adjusts for different positivity rates in different regions, but presents the estimated level using the East Midlands as a reference region. There may be some variation in the percentage testing positive between regions, but the East Midlands is representative for the positivity rates across regions of England. The age model differs from the national positivity models mentioned previously, as it does not post-stratify estimates, therefore results are not adjusted to reflect the underlying population sizes. Furthermore, our age category model does not include the same interaction terms with time as our national headline positivity model, and therefore results are not comparable.
Methodology used to produce single year age over time estimates by UK country
To assess swab positivity over time by single year of age, we used generalised additive models (GAM) with a complementary loglog link and tensor product smooths. The latter allows us to incorporate smooth functions of age and time, where the effect of time is allowed to be different dependent on age. Tensor product smooths generally perform better than isotropic smooths when the covariates of a smooth are on different scales, for example, age in years and time in days.
The Unbiased Risk Estimator (UBRE) criterion was used to optimize the smoothness of the curve given the observed data. The analyses are based on the most recent eight weeks of data on swab positivity among individuals aged 2 to 80 years. The effect of age and time are allowed to vary by region, but marginal probabilities and their confidence intervals are obtained for the whole of England. Separate models are run for England, Wales, Scotland and Northern Ireland.Back to table of contents
Method for producing incidence between 13 July 2020 and 28 November 2020
The incidence of new infections (the number of new infections in a set period of time) helps us understand the rate at which infections are growing within the population and supports our main measure of positivity (how many people test positive at any time, related to prevalence) to provide a fuller understanding of the coronavirus (COVID-19) pandemic. The incidence rate is different to the R number, which is the average number of secondary infections produced by one infected person and is produced by the Scientific Pandemic Influenza Group on Modelling (SPI-M), a sub-group of the Scientific Advisory Group for Emergencies (SAGE).
Estimates for incidence from 13 July 2020 to 28 November 2020 considered every day that each participant was in the study from the date of their first negative test to the earlier of their latest negative test or the greater of seven days before or half way between their last negative test and first positive test, which are called days at risk (for a new positive test in the study). Each new positive was considered to represent an infection starting at the mid-point between the day of the test and the previous negative swab or at seven days before the day of the test, whichever was closest to the first positive test. This is because we do not know the exact point when the infection occurred and infections only last so long. We excluded everyone whose first swab test in the study was positive, so this method looked at new positives found during the study. Each week our incidence model used a Bayesian Multilevel Regression and Poststratification (MRP) model (log link) with thin plate splines to produce a smooth estimate of incidence over the preceding eight weeks of data. The model censored follow up at the start of the reference week. The most recent official estimate of incidence was defined as the estimate on this last day included in the model. The week after, the next estimate was produced using a week of entirely new data, augmented data for the preceding week (due to additional test results being received for the previous reference week), and the same previous data back to seven weeks (eight weeks data in total). Official estimates were not revised using estimates from later models.
We started recruiting participants on 26 April 2020 and started repeating tests on 1 May 2020. Therefore, only data from 11 May 2020 onwards were included in the incidence model, as to be included in the incidence analysis at least two repeated swab test visits are required.
Why was a new method needed?
When enrolled on the survey, participants are swabbed weekly for five weeks and then move to monthly swabbing. Until mid-November 2020, the majority of visits (at least 60% to 75%) were from participants being swabbed weekly providing us with regular and timely updates on the number of new positive tests and the "time at risk". However, because we recruited a large number of people in August to September 2020, the proportion swabbed monthly increased during November 2020. This had the consequence that the assumption that new positives in the study represented (almost) all new infections was not sustainable, because of the longer gap between visits. Further, our estimates of "days at risk" in the final four weeks in the model increasingly under-estimated time at risk in this period because of the increasing numbers on monthly visits who had not yet had their next visit. As a result, the series became inconsistent. The method of estimation therefore needed changing to account for the pattern of monthly tests.
The new method needed to account for the fact that the majority of survey respondents are on monthly visits and the period of time between tests is long enough to miss a significant proportion of infections.
New method and what has changed
Our new method estimates the incidence of PCR-positive cases (related to the incidence of infection) from the MRP model of positivity, using further detail from our sample. Because we test participants from a random sample of households every day, our estimate of positivity is unbiased providing we correct for potential non-representativeness due to non-participation by post-stratifying for age, sex and region.
We use information from people who ever test positive in our survey (from 1 September 2020) to estimate how long people test positive for. We apply information from this group to the whole of the sample and produce an estimate for incidence for the whole of the household population. We estimate the time between the first positive test and the last time a participant would have tested positive (the "clearance" time) using a statistical model. We do this accounting for different times between visits.
With these clearance time estimates we can then model backwards, deducing when new positives occurred in order to generate the positivity estimate. This method uses a deconvolution approach developed by Joshua Blake, Paul Birrell and Daniela De Angelis at the MRC Biostatistics Unit and Thomas House at the University of Manchester. Posterior samples from the MRP model over the last 100 days are used in this method.
Clearance time is the length of time that an individual remains positive. An estimate of the distribution of clearance times derived from the COVID-19 Infection Survey is used, that varies by the date a participant first tests positive. The distribution of clearance times is estimated by modelling the time from an individual's first positive test in the COVID-19 Infection Survey. Only first positive tests from 1 September 2020 onwards are included in this model, given the very low rates of positivity observed over the summer of 2020, and the small numbers before this time.
Clearance time considers the sequence of positive and negative test results of an individual:
the clearance time for individuals testing negative, following a positive test, is modelled as occurring at some point between their last positive and first negative test
intermittent negatives (consisting of three or fewer consecutive negative tests) between positive tests within 90 days of their previous positive test are ignored as this is considered a single episode of infection in that period, following World Health Organization guidance
new positives that occur more than 90 days after an individual's previous positive test are treated as a new episode of infection providing the participant has one or more immediately preceding negative tests, as are new positives that occur after four consecutive negatives
participants who are last seen positive are censored at their last positive test
The estimated distribution of clearance times is modelled using flexible parametric interval censored survival models, choosing the amount of flexibility in the model based on the Bayesian Information Criterion. We allow the distribution of clearance times to change according to the date a participant first tests positive.
There is a bias in estimating the clearance distribution since the analysis used to estimate how long a person stays positive, only starts from their first positive test. Since (most) people will have become positive on an earlier day, this will bias the clearance curves downwards (making the estimates too short). However, there is another bias due to missing positive episodes entirely if they are short -- meaning that our dataset has fewer short positive episodes than in the population as a whole, and that the sample used to run the survival analysis is biased towards people with longer positive episodes. This will bias the clearance curves upwards (making the estimates too long). We included whether or not the first positive a participant had in the survey was their first test in the study, and if not, how many days their last negative test was previously as explanatory variables. There was no evidence that either of these variables were associated with clearance time, and we have therefore used the overall estimate.
The estimate of the incidence of PCR-positive cases (relating to the incidence of infection) is produced by combining a posterior sample from the Bayesian MRP positivity model with the estimated distribution of the clearance times, allowing for the fact that some people will remain positive for shorter or longer times than others. Once the distribution of clearance is known, a deterministic transformation (known as deconvolution) of the posterior of the positivity is computed. The resulting sample gives the posterior distribution of the incidence of PCR-positive cases.
Incidence estimates based on the MRP positivity model are calculated for the entire period of data in the MRP positivity model but presented excluding the first two weeks. This is to avoid boundary effects (at the start of the positivity model, infections will have happened at various points in the past).
The official estimate of incidence is the estimate from this model at the reference date. The reference date used for our official estimates of incidence is 10 days prior to the end of the positivity reference week. This is necessary as estimates later than this date are more subject to change as we receive additional data.
This new method of estimating incidence enables us to estimate incidence for Wales, Scotland and Northern Ireland, as well as for England, as we can assume the same clearance distribution across all countries.
Figure 1 compares the previously published official estimates of incidence for England between 4 September 2020 and 28 November 2020 (points with credible intervals in chart) to an indicative incidence estimate based on the new method (solid line in chart).
Figure 1: Comparison of official estimates of incidence based on the old model compared with estimates based on the new model
Estimated numbers of new PCR-positive COVID-19 cases in England, based on nose and throat swabs with modelled estimates from 4 September 2020 to 28 November 2020
Weighted antibodies estimate by country
From 23 October 2020 we have presented weighted monthly estimates for the number of people testing positive for antibodies to SARS-CoV-2 and from 3 February 2021 we started presenting these for rolling 28-day periods in our fortnightly antibody article. The rolling 28-day estimates of the number of people who have detected antibodies were based on weighted data to ensure the estimates are representative of the target population in England, Wales, Northern Ireland and Scotland.
The study is based on a nationally representative survey sample; however, some individuals in the original Office for National Statistics (ONS) survey samples will have dropped out and others will not have responded to the study. For England and Wales, to address this and reduce potential bias, we apply weighting to ensure the responding sample is representative of the population in terms of age (grouped), sex, region, and ethnicity. For Northern Ireland and Scotland, we adjust for age (grouped), sex and region. This is because ethnicity is already well represented in the survey for these devolved administrations.
Why was a new method needed?
The current estimate represents a single point over a 28-day period. With the speed of the vaccination roll out, antibody estimates are increasing within each 28-day period. This change in underlying antibody positivity means we will be continuously underestimating the antibody positivity for each country, region and age group.
New method and what has changed
From 30 March 2021 we will present estimates of antibodies and, for the first time, estimates of vaccine uptake using a new model, which will allow us to investigate antibody positivity and vaccine uptake in the population. Antibody positivity is measured by antibodies to the spike (S) protein. Vaccine uptake is tracked over all visits over time. For vaccine uptake we merge our self-reported vaccination data from the sample with data from the National Immunisation Management Service (NIMS). NIMS is the System of Record for the NHS COVID-19 vaccination programme in England.
Modelled antibody estimates use a spatial-temporal integrated nested Laplace approximation (INLA) model with post-stratification. Post-stratification is a method to ensure the sample is representative of the population that can be used with modelled estimates to achieve the same objective as weighting. This model is similar to the dynamic Bayesian multi-level regression and post-stratification (MRP) model used for national and regional swab positivity analysis and is the same model approach as used to produce the subregional estimates for swab positivity. Spatial-temporal in this context means the model borrows strength geographically and over time. For both antibody and vaccine estimates, we run two separate models: one for Great Britain and the other for Northern Ireland. All models are run on surveillance weeks (a standardised method of counting weeks from the first Monday of each calendar year to allow for the comparison of data year after year and across different data sources for epidemiological data).
The antibodies model for Great Britain is currently run at a regional level and includes ethnicity, vaccine priority age groups, and sex. The antibody model for Northern Ireland is a temporal model (no spatial component) due to lower sample size, and accounts for sex and age in wider groups (16 to 24, 25 to 34, 35 to 49, 50 to 69, 70 years and over).
The vaccines model for Great Britain is currently run at a subregional level and includes ethnicity, vaccine priority age groups, and sex. The vaccine model for Northern Ireland is also run at a subregional level due to a higher number of participants with information about vaccine uptake. The model accounts for sex and age in wider groups (16 to 24, 25 to 34, 35 to 49, 50 to 69, 70 years and over).Back to table of contents
The 14-day estimates of the number of people who have the coronavirus (COVID-19) are based on weighted data to ensure the estimates are representative of the target population in England, Wales, Northern Ireland and Scotland. The study is based on a nationally representative survey sample; however, some individuals in the original Office for National Statistics (ONS) survey samples will have dropped out and others will not have responded to the study.
To address this and reduce potential bias, we apply weighting to ensure the responding sample is representative of the population in terms of age (grouped), sex and region. This is different from the modelled estimates, which use a different method to adjust for potential non-representativeness of the sample through multi-level regression post-stratification (described in Section 9: Modelling).
Every fortnight, we present our estimates for the number of people testing positive for antibodies across the UK in our antibody characteristics publication. The rolling 28-day estimates of the number of people who have detected antibodies are based on weighted data to ensure the estimates are representative of the target population in England, Wales, Northern Ireland and Scotland.
The study is based on a nationally representative survey sample; however, some individuals in the original Office for National Statistics (ONS) survey samples will have dropped out and others will not have responded to the study. For England and Wales to address this and reduce potential bias, we apply weighting to ensure the responding sample is representative of the population in terms of age (grouped), sex, region, and ethnicity. For Northern Ireland and Scotland, we adjust for age (grouped), sex and region. This is because ethnicity is already well represented in the survey for these devolved administrations.
Confidence intervals for estimates
The statistics are based on a sample, and so there is uncertainty around the estimate. Confidence intervals are calculated so that if we were to repeat the survey many times on the same occasion and in the same conditions, in 95% of these surveys the true population value would be contained within the 95% confidence intervals. Smaller intervals suggest greater certainty in the estimate, whereas wider intervals suggest uncertainty in the estimate.
Confidence intervals for weighted estimates are calculated using the Korn-Graubard method to take into account the expected small number of positive cases and the complex survey design. For unweighted estimates, we use the Clopper-Pearson method as the Korn-Graubard method is not appropriate for unweighted analysis.Back to table of contents
Simple explanations of confidence and credible intervals have been provided in previous sections, nevertheless, there is still some question about the difference between these two intervals. Whether we use credible or confidence intervals, depends upon the type of analysis that is conducted.
Earlier in the article, we mentioned the positivity model is a dynamic Bayesian multi-level regression post stratification model. This type of analysis produces credible intervals that are used to show uncertainty in parameter estimates, because this type of analysis directly estimates probabilities. While, for the 14-day positivity estimates confidence intervals are provided because this is a different type of analysis using what are called frequentist methods. The use of confidence and credible intervals is a direct consequence of the type of statistics used to make sense of the data: frequentist statistics or Bayesian statistics respectively.
The difference between credible intervals and confidence intervals are associated with their statistical underpinnings; Bayesian statistics are associated with credible intervals, whereas confidence intervals are associated with frequentist (classical) statistics. Both intervals are related to uncertainty of the parameter estimate, however they differ in their interpretations.
With confidence intervals, the probability the population estimate lies between the upper and lower limits of the interval is based upon hypothetical repeats of the study. For instance, in 95 out of 100 studies, we would expect that the true population estimate would lie within the 95% confidence intervals. While the remaining five studies would deviate from the true population estimate. Here we assume the population estimate is fixed and any variation is due to differences within the sample in each study. Whereas credible intervals aim to estimate the population parameter from the data we have directly observed, instead of an infinite number of hypothetical samples. Credible intervals estimate the most likely values of the parameter of interest, given the evidence provided from our data. Here we assume the parameter estimates can vary based upon the knowledge and information we have at that moment. Essentially, given the data we have observed there is a 95% probability the population parameter falls within the interval. Therefore, difference between the two concepts is subtle: the confidence interval assumes the population parameter is fixed and the interval is uncertain. Whereas, credible intervals assume the population parameter is uncertain and the interval is fixed.Back to table of contents
Where we have done analysis of the characteristics of people who have ever tested positive for the coronavirus (COVID-19), we have used pairwise statistical testing to determine whether there was a significant difference in infection rates between pairs of groups for each characteristic.
The test produces p-values, which provide the probability of observing a difference at least as extreme as the one that was estimated from the sample if there truly is no difference between the groups in the population. We used the conventional threshold of 0.05 to indicate evidence of genuine differences not compatible with chance, although the threshold of 0.05 is still marginal evidence. P-values of less than 0.001 and 0.01 and 0.05 are considered to provide relatively strong and moderate evidence of genuine difference between the groups being compared respectively.
Any estimate based on a random sample rather than an entire population contains some uncertainty. Given this, it is inevitable that sample-based estimates will occasionally suggest some evidence of difference when there is in fact no systematic difference between the corresponding values in the population as a whole. Such findings are known as "false discoveries". If we were able to repeatedly draw different samples from the population, then, for a single comparison, we would expect 5% of findings with a p-value below a threshold of 0.05 to be false discoveries. However, if multiple comparisons are conducted, as is the case in the analysis conducted within the Infection Survey, then the probability of making at least one false discovery will be greater than 5%.
Multiplicity can occur at different levels. For example, in the Infection Survey we have:
two primary outcomes of interest -- positivity for current infection based on a swab test and positivity for previous infection based on a blood test
several different exposures of interest (for example, age and sex)
several exposures with multiple different categories (for example, working location)
repeated analysis over calendar time
Consequently, the p-values used in our analysis have not been adjusted to control either the familywise error rate (FWER, the probability of making at least one false discovery) or the false discovery rate (FDR, the expected proportion of discoveries that are false) at a particular level. Instead, we focus on presenting the data and interpreting results in the light of the strength of evidence that supports them.Back to table of contents
Since 20 November 2020, we have presented modelled estimates for the most recent week of data at the sub-national level for England and for Wales, Northern Ireland and Scotland since 19 February 2021. To balance the granularity with the statistical power, we have grouped together groups of local authorities into COVID-19 Infection Survey sub-regions. The geographies are a rule-based composition of local authorities, and local authorities with a population over 200,000 have been retained where possible. For our Northern Ireland sub-regional estimates, our CIS sub-regions are NHS Health Trusts instead of groups of local authorities. The boundaries for these COVID-19 infection Survey sub-regions can be found on the Open Geography Portal.Back to table of contents
The statistics produced by analysis of this survey contribute to modelling, which predicts the reproduction number (R) of the virus.
R is the average number of secondary infections produced by one infected person. The Scientific Pandemic Influenza Group on Modelling (SPI-M), a sub-group of the Scientific Advisory Group for Emergencies (SAGE), has built a consensus on the value of R based on expert scientific advice from multiple academic groups.Back to table of contents
The estimates presented in this bulletin contain uncertainty. There are many sources of uncertainty, but the main sources in the information presented include each of the following.
Uncertainty in the test (false-positives, false-negatives and timing of the infection)
These results are directly from the test, and no test is perfect. There will be false-positives and false-negatives from the test, and false-negatives could also come from the fact that participants in this study are self-swabbing. More information about the potential impact of false-positives and false-negatives is provided in "Sensitivity and Specificity analysis".
The data are based on a sample of people, so there is some uncertainty in the estimates
Any estimate based on a random sample contains some uncertainty. If we were to repeat the whole process many times, we would expect the true value to lie in the 95% confidence interval on 95% of occasions. A wider interval indicates more uncertainty in the estimate.
Quality of data collected in the questionnaire
As in any survey, some data can be incorrect or missing. For example, participants and interviewers sometimes misinterpret questions or skip them by accident. To minimise the impact of this, we clean the data, editing or removing things that are clearly incorrect. In these initial data, we identified some specific quality issues with the healthcare and social care worker question responses and have therefore applied some data editing (cleaning) to improve the quality. Cleaning will continue to take place to further improve the quality of the data on healthcare and social care workers, which may lead to small revisions in future releases.Back to table of contents
Contact details for this Methodology
Telephone: +44 (0)1633 65 1689