1. Introduction

This technical report accompanies Understanding well-being inequalities: Who has the poorest personal well-being? an exploration of factors associated with the lowest reports of personal well-being. Using three years of data from the Annual Population Survey (APS) (January 2014 to December 2016), the characteristics and circumstances of people with poorest personal well-being were compared against others who reported higher personal well-being. Following this are more in-depth analyses which explore the nature of associations between these factors and personal well-being.

The research took an iterative approach, involving descriptive analysis followed by logistic regression and latent class analysis (LCA). The logistic regression isolates single factors that impact on the odds of reporting the lowest personal well-being levels. The LCA identifies combinations of factors that frequently occur together among those with poorest personal well-being. Logistic regression and LCA are complementary techniques, appreciating that while single factors affect personal well-being, in practice, combinations of influential factors tend to go together. This methodology paper describes how these techniques were applied.

Back to table of contents

2. The Annual Population Survey, 2014 to 2016

The three-year Annual Population Survey (APS) dataset has a sample size of 543,298 respondents of which 284,456 were aged 16 years and over and eligible to be asked personal well-being questions. Of these, 280,003 (over 98%) answered all four personal well-being questions and were included for analysis.

Both logistic regression and latent class analysis (LCA) cannot be applied to missing data. With more variables included in a model, there is a greater likelihood that a case will contain missing data and so be excluded from analysis. We found that better LCA data was produced when applied to fewer variables when compared to the logistic regression. As a result, 192,567 cases were included in the logistic regression model and 227,139 in the LCA.

There are four personal well-being questions:

  1. Overall, how satisfied are you with your life nowadays?

  2. Overall, to what extent do you feel the things you do in your life are worthwhile?

  3. Overall, how happy did you feel yesterday?

  4. Overall, how anxious did you feel yesterday?

The responses to all four personal well-being questions are measured on a 0 to 10 scale, where 0 is “not at all” and 10 is “completely”. For the three positively framed questions (questions 1 to 3 above), a score of 4 or less is deemed to be “poor”, and for the anxiety question (question 4 above), a score of 6 or more is defined as “poor” (as it indicates higher anxiety). In this research, individuals defined as having poorest well-being are those who reported life satisfaction, worthwhile and happiness scores of 4 or less, in addition to an anxiety score of 6 or more.

Of the 280,003 respondents who answered all four personal well-being questions, 3,135 reported poorest personal well-being – approximately 1% of the sample. Similarly, with survey weighting taken into account, this represents about 1% of the UK population. A binary variable was derived to flag respondents with or without poorest personal well-being, allowing for the characteristics of those with poorest personal well-being to be compared with those who reported higher personal well-being.

Missing data and bias

As noted, cases with missing data for variables included in the logistics regression and LCA model were excluded from analysis. Missing data can produce biased estimates and invalid conclusions, particularly if data are not “missing at random” or, in other words, if there is some (unknown) patterning to that “missingness” (Graham, 2009).

People with certain characteristics, for example, may be less likely to answer the personal well-being questions accurately. The three variables with the largest proportion of missing data were: education (17.0%), sexual orientation (9.9%) and disability status (7.0%).

Back to table of contents

3. Logistic regression

Logistic regression analysis allows for the relationship between an explanatory variable and the outcome variable to be examined, while at the same time accounting for other explanatory variables that influence the outcome. It is used when looking at categorical outcomes. While it is possible to conduct multinomial logistic regression with multiple categorical outcomes, logistic regression with binary outcomes was chosen to increase ease of understanding (with the predicted outcomes either “poorest personal well-being” or “higher personal well-being”) and for consistency with the latent class analysis (LCA) which can only be applied to categorical data.


This analysis was carried out using R. The package used for the logistic regression was mlogit. After removing those cases where there were missing data in the predictor variables, 192,567 cases were included. Variables were then added one-by-one to build the logistic regression model.

Goodness of fit

Goodness of fit describes how well a model fits the data from which it is generated. After the addition of each variable to the model, goodness of fit and change in the coefficients were assessed. The variables tested included sex, age, marital status, self-reported health, self-reported disability, socio-economic activity, education, housing tenure, ethnicity, sexual identity and religion.


Regression analysis can identify relationships between factors; however, it cannot tell us about causality. While, for some factors, causality is fairly clear based on prior knowledge (for example, poorest personal well-being does not cause someone to become widowed, however, becoming widowed can cause poorest personal well-being), for others the relationship between cause and effect is more blurred (for example, having very bad or bad health can cause poorest personal well-being, but also poorest personal well-being can negatively impact on health). Therefore, where prior knowledge does not make the direction of causality clear, it is important to note that causality can operate in either direction (or both).


Weights were included in the logistic regression to compensate for unequal selection probabilities and differential non-response. Our regression models take the weights into account. For more information about how the Annual Population Survey (APS) datasets are weighted to reflect the size and composition of the general population, please see Personal well-being in the UK Quality and Methodology Information.

Interpretation of the results

Odds are the probability of an event occurring divided by the probability of the event not occurring. The odds ratio, which is the ratio between two sets of odds, is the usual output from logistic regression. The odds ratio for each variable in the model is obtained by exponentiating the estimate. For this analysis, the odds ratio represents the odds of reporting poorest personal well-being for given predictor variables relative to the reference category while holding all other variables constant. This reveals how personal characteristics and circumstances relate to odds of reporting poorest personal well-being.

Back to table of contents

4. Latent class analysis

Latent class analysis (LCA) is a technique used to identify sub-groups within a population. It classifies individuals into mutually exclusive groups or “types” based on patterns of characteristics represented as categorical variables. LCA was used to group individuals with similar characteristics including:

  • age

  • self-reported health

  • self-reported disability (as defined by the Equality Act 2010)

  • housing tenure

  • economic activity

  • socio-economic activity

Table 2 presents the social characteristics of each class. In Class 8, for example, 82.4% were found to self-report a disability whereas 5.8% in Class 7 were found to self-report a disability.

Logistic regression was used to calculate the odds of reporting poorest personal well-being for members of each group (Table 3).

The model shows that individuals are at significantly different risk of reporting poorest personal well-being, depending on which latent class they belong to. As 1% of the UK population have poorest personal well-being, before any characteristics are taken into account an individual selected at random has a 1 in 100 chance of reporting poorest personal well-being. With individual characteristics taken into account, those at greatest risk of having the poorest personal well-being are in Class 4 (1 in 41 chance), Class 5 (1 in 32 chance) and Class 8 (1 in 71 chance). In the article, Understanding well-being inequalities: Who has the poorest personal well-being, only these classes have been reported as they represent the main focus of the analysis; Class 4 are “Employed renters with self-reported health problems or disability”, Class 5 are “Unemployed or inactive renters with self-reported health problems/disability” and Class 8 are “Retired homeowners with self-reported health problems or disability”.

Optimal number of classes

LCA analysis reduces complexity by splitting a dataset up into meaningful sub-groups based on the specified characteristics. The process involves running the algorithm on the same data with different numbers of classes specified. The analyst first specifies one group, then two groups, then three and so on. To better ensure data quality, this process was first applied to random sub-samples of the dataset with 60,000, 80,000 and 100,000 cases to ensure consistency. The final specification was then applied to the full sample.

With each run a goodness of fit statistic, the Bayes Information Criterion (BIC), is produced. All other things equal, a lower BIC value suggests a better fitting model (Lin and Dayton 1997).

Although a model with a lower BIC value may suggest a better fitting model, separation into greater numbers of classes has disadvantages. Doing so can increase complexity, making interpretation and communication of findings more difficult, while splitting the dataset into more groups can mean fewer respondents fall into each class thereby potentially reducing statistical power of the model.

The BIC value fell continuously as the number of classes specified increased. However, past eight groups the composition of characteristics associated with poorest personal well-being changed little. As such, eight classes was selected as the most useful model for this release.

Back to table of contents

Contact details for this Methodology

Edward Pyle
Telephone: +44 (0)1633 582486