1. Overview
Classification error in linked administrative datasets can be a difficult problem to overcome. For example, when you have three sources, and one says person A is unemployed and the other two say person A is employed, which is correct? Rule-based methods can be used to select an answer, but their estimates generally do not account for uncertainty and can be biased. An alternative class of approaches includes automated methods, which have the potential to take account of misclassification in predictions.
One such method is multiple imputation latent class modelling (MILC), as outlined by Boeschoten, Oberski and de Waal in their 2017 article Estimating Classification Errors Under Edit Restrictions in Composite Survey-Register Data Using MILC. MILC combines bootstrap sampling, latent class modelling, and multiple imputation. It aims to produce estimates of interest by classification (for example, the proportion of the population in each ethnic group) that incorporate multiple kinds of uncertainty, including error in the original data sources, and parameter uncertainty associated with the latent class models. It also provides estimates of the amount of classification error in a combined dataset with categorical variables.
This is a brand-new method applied within the Office for National Statistics (ONS) and relatively new outside it. There has been limited application of MILC to large administrative datasets, but the existing research highlights potential utility of the method for improving administrative-based estimates for variables such as employment, ethnicity, and other categorical variables that may appear on multiple sources.
In this methodology, we test the application of MILC to administrative data in the ONS for the first time, to assess how well it works and whether there is value in pursuing further research. The results of this research will help us to refine the methodology and code for future application. It will help to highlight potential pitfalls and snagging points, shed light on the overall feasibility of this method within a large, administrative data context, and pave the way for future research and development.
If you have any questions, comments or would like to collaborate with the Methodological Research Hub, please contact Methods.Research@ons.gov.uk.
Back to table of contents2. Background
When combining multiple sources into a linked dataset, classification error can be exposed. Classification error is where a value of a variable in one dataset differs erroneously from (what should be) the same value in another dataset. For example, one source that says person A is unemployed, and another that says person A is employed. It is hard to determine the true classification when there is classification error, and there is a growing demand for more automated processes that quantify and resolve this type of error in an administrative data context.
Model-based methods are increasingly gathering interest across academia and within National Statistical Institutes for their potential to estimate the "correct" classifications across variables and data sources. One such pioneering model-based method for dealing with classification error is multiple imputation latent class modelling (MILC), as outlined by Boeschoten, Oberski and de Waal in their 2017 article Estimating Classification Errors Under Edit Restrictions in Composite Survey-Register Data Using MILC. A primary benefit of this method is that the latent class element of it means there is no need for "gold standard" data containing a true error-free variable. Latent class modelling (LCM) uses multiple sources to estimate a "latent" variable. This "latent" variable is assumed to symbolise the true error-free variable (although be aware that your results will still depend on the specified models and its assumptions, so could still potentially include bias). The central idea of MILC is that there is no perfect data source or method -- there will always be some level of error in every dataset and associated with every method and model. MILC factors this uncertainty into final estimates.
The outcomes of MILC include:
"final estimates of the “true” proportion or count in each class in the categorical variable of interest; accounting for uncertainty caused by error in the individual data sources and conflicting values across the sources (for instance, classification error), and uncertainty associated with the latent class models
variance and confidence intervals around those final estimates
estimates of the amount of classification error in a combined dataset, per source, per category
an imputed dataset for further analysis
Current methods used with administrative data in the Office for National Statistics (ONS) for ethnic group do not estimate the amount of classification error in the sources, nor do they produce uncertainty measures around ethnic group classification estimates. Deterministic methods (such as rule-based methods) also require strong assumptions around what is correct so are at high risk of bias. MILC is an alternative method that could potentially fill this gap. It has been recommended, used, and developed by academics at Utrecht University, Tilburg University, and Statistics Netherlands. It has also been applied to Dutch register-based datasets. For example, it has been used to successfully estimate the number of serious road injuries from police and hospital register data in the Netherlands, as described by Boeschoten, de Waal, and Vermunt in their 2019 article Estimating the number of serious road injuries per vehicle type in the Netherlands by using multiple imputation of latent classes. It is currently being used by Statistics Netherlands for employment statistics, and it is also being explored by the Italian National Institute of Statistics.
The Netherlands' register-based datasets typically have greater coverage, less linkage error, and less problematic missingness than UK administrative datasets (when using them for statistical purposes). For this reason, it is currently unclear how feasible it is to apply this method to UK-based administrative data.
Back to table of contents3. Current aims and application
The research presented within this paper is a feasibility project to see whether we could apply the multiple imputation latent class modelling (MILC) method to administrative data at the Office for National Statistics (ONS).
As this was a trial, the aim was to estimate the population in one local authority in England by ethnic group classification using administrative data. Additionally, the software we had available at the time of the trial was poLCA, which is an R package as described in Linzer and Lewis' 2011 article poLCA: An R Package for Polytomous Variable Latent Class Analysis. This software restricted the size of the data we could apply the method to within our secure administrative data storage platform, reinforcing our decision to apply the method to one local authority.
We also wanted to estimate the classification error of the individual sources. As this was the first time that we had applied MILC in the ONS, we adopted the most parsimonious application of the model: one variable of interest (ethnicity) with binary classifications using three administrative datasets.
In keeping with other elementary applications of models using latent classes, the classification was binary; in our application, ethnic group classifications consisted of White and Non-White. We hope to explore more ethnic group classifications in future iterations as we develop this research further.
Our aims were to explore:
what the outcomes of applying MILC are, how it works, and what it tells us about our data
how it performs when applying it to administrative data in the UK, whether it is easy to run, and if there is anything further we need to adapt or add to make this applicable to larger datasets
how we can apply this in the future, what needs to be in place to make this happen, and how feasible it is
4. Methodology
Data
We used a linked dataset that had the 2016 Statistical Population Dataset (SPD) version 3.0 (V3.0) for England (formerly known as the admin-based population estimates (ABPE)), as the population base. This had ethnicity data linked on at person level from Hospital Episode Statistics (HES; 2009 to 2016), English School Census (ESC; 2011 to 2016) and Improving Access to Psychological Therapies (IAPT; 2012 to 2016). Individuals could appear multiple times within each dataset, potentially with different ethnicities recorded on different records (for example, person A might appear in one source three times, with three different ethnicities). In the creation of the linked dataset, prior to the application of the MILC method, a rule-based approach was applied to select one ethnicity per person for each data source that they were in. Any error from the rule-based approach is not accounted for when we apply the MILC method. MILC currently only resolves between-source conflict, so further work would be required to explore how this could be incorporated in future. Not every individual had an ethnicity recorded in every dataset, and not every individual was present in every dataset. For more detail on this, please see previous Office for National Statistics (ONS) work on Producing admin-based ethnicity statistics for England: methods, data and quality. This meant that for the purposes of this analysis, there was quite a lot of missingness across the linked sources.
Method
To obtain the indicators of classification error, we ran one overall latent class model (LCM) on the whole local authority from the data specified in the previous sections.
To obtain estimates of the proportion of people in each class, we applied the general steps of MILC. These steps were outlined by Boeschoten, Oberski and de Waal in their 2017 article Estimating Classification Errors Under Edit Restrictions in Composite Survey-Register Data Using MILC. In short, we drew five bootstrap samples from the linked dataset (each the same size as the original linked dataset), estimated an LCM on each, imputed into a new variable for each LCM, derived the estimate of interest from each new variable, and then pooled these estimates into a single, "final" estimate. The models specified contained three indicators: the ethnicity variable from each of the three sources. Cases with missing values were kept in during model estimation because of the large amount we would be excluding if we were to remove them. For more technical detail about these steps, please contact us at methods.research@ons.gov.uk.
Back to table of contents5. Results
Assumptions
The core assumptions of multiple imputation latent class modelling (MILC) for this application were that the missingness mechanism was "missing at random" and that the probability of classification error on one source was independent of the probability of classification error on the other sources. We did not test for violations of these assumptions specifically. Based upon what we know about administrative data in general, it is possible that the "missing at random" assumption could be violated. If this assumption is violated, it can affect MILC's estimates and potentially lead to biased results. Future work should explore whether we can account for this in our models, potentially using different software that allows us to build more complex models. Furthermore, when applying MILC, substantive knowledge regarding the data sources with respect to the assumptions is ideal, to assess the quality of the outcomes. For this reason, it is recommended that future applications are paired with qualitative research to develop in-depth knowledge of the data sources used.
Latent class model performance
The entropy R² results of the latent class models (LCMs) are presented in Table 1.
Model | Entropy R² |
---|---|
1 | 0.672 |
2 | 0.657 |
3 | 0.893 |
4 | 0.648 |
5 | 0.894 |
Download this table Table 1: Entropy R² of latent class models
.xls .csvThe entropy R² indicates how well the model can correctly classify based on the data, with a value of 1 meaning perfect prediction. Generally, a value of at least 0.7 is recommended for meaningful classification. Three of the five bootstrapped models had an entropy R² a little lower than this. While this makes the imputations from the models less reliable, the point of this research was to test how the method works and the feasibility of applying it within the Office for National Statistics (ONS), not to refine and perfect the model. To improve the model, future research could look at including informative covariates.
Method outputs
Table 2 displays the outputs produced from this application.
Proportion | Variance | Upper Confidence Interval | Lower Confidence Interval | |
---|---|---|---|---|
White | 0.879 | 0.00000148 | 0.882 | 0.877 |
Non-White | 0.121 | 0.00000148 | 0.123 | 0.118 |
Download this table Table 2: Proportions, variance, and confidence intervals from MILC output
.xls .csvThe method estimated that 87.9% (95% confidence interval: 87.7 to 88.2%) of people in the dataset belonged in the aggregated “White” latent class, and 12.1% (95% confidence interval: 11.8 to 12.3%) belonged in the aggregated “Non-White” latent class.
The indicators of classification error for each class in each source are presented in Table 3.
English School Census (ESC) | Improving Access to Psychological Therapies (IAPT) | Hospital Episode Statistics (HES) | |
---|---|---|---|
White | 13% | 21% | 12% |
Non-White | 3% | 0% | 0% |
Download this table Table 3: Classification-error indicators from MILC for each source
.xls .csvFor the people who the latent class model classified as being White, 21% were not classified as White in Improving Access to Psychological Therapies (IAPT), 13% in English School Census (ESC), and 12% in Hospital Episode Statistics (HES). For the people who the latent class model classified as being Non-White, 3% were not classified as Non-White in ESC, and 0% in IAPT and HES.
Further analysis of how these results compare with the individual sources is not presented in this methodology article because of the differences in coverage and missingness patterns of the sources, and the fact that this was a proof-of-concept project. As a result, further work is needed to optimise the model and method before such comparisons can be made.
In terms of MILC's performance, the method did not run into any model identification or convergence issues despite the presence of item missingness (for instance, ethnicity information missing). MILC can estimate a person's ethnicity as long as they are present on at least one of the sources, even if they do not have an ethnicity recorded.
Back to table of contents6. Future developments
The purpose of this research was to explore what multiple imputation latent class modelling (MILC) produces, consider the utility of these outputs, and understand how it can be applied to administrative data in the Office for National Statistics (ONS). The point was to pave the way for future developments, not to produce official estimates of ethnic groups or classification error.
Based upon this initial application, we recommend that MILC receives further research in the ONS, as limited methods exist for estimating and correcting for classification error across administrative data sources. MILC has the potential to meet requirements such as improving the quality of methods, data, and analysis, producing estimates of ethnic group classifications from administrative data sources, and contributing to understanding the quality of these administrative data sources.
Recommendations
More development is needed to apply MILC in more complex ways within the ONS (for example, addition of covariates, larger number of classes, and multivariate analysis); further research would need to assess its robustness to violations of model assumptions, performance, disclosiveness, and ethics when applied to more classifications.
We will also be exploring further how we can apply latent class modelling (LCM) in an administrative data context; LCM can be used separately from the MILC method to estimate classification error in categorical administrative data, and it has not been widely applied to large administrative datasets in this way within the ONS to date.
Continue exploring how best to optimise the model with our administrative data.
Explore using alternative software, such as LatentGOLD, for LCM estimation, as it allows for more complex models and has better functionality than current R packages.
Scope of MILC and separate issues
The method does not address coverage issues; these would need to be addressed separately beforehand.
MILC indirectly deals with item missingness, but it was not created for this purpose; generally, the more complete values in the source variables, the better.
One assumption is that missing values are missing at random, and if this is violated, it may lead to biased estimates; this assumption is often violated in administrative data in general, and it is not currently known exactly how this affects the outputs of MILC, but this should be explored further in future research.
As with other methods, MILC cannot distinguish between conflicting values because of real classification error on the sources or conflicting values introduced by linkage error; this could be a useful feature if you are interested in getting an indication of both classification and linkage error together, but if you are interested only in classification error, it is advised to implement this method after resolving as much linkage error as possible.
It is best to run MILC once coverage issues have been resolved (or on datasets where there are minimal coverage issues) as MILC itself does not deal with coverage; one important extension would be to explore combining MILC with a method such as capture-recapture or multiple system estimation, which do aim to resolve coverage issues.
7. Acknowledgements
We would like to thank Laura Boeschoten for her input and support throughout this research. We are also grateful to our colleagues within the Office for National Statistics for their support and topic knowledge.
Back to table of contents8. References
Boeschoten L, de Waal T, and Vermunt J K (2019), ‘Estimating the number of serious road injuries per vehicle type in the Netherlands by using multiple imputation of latent classes’, Journal of the Royal Statistical Society, Volume 182, Issue 4, pages 1463 to 1486
Boeschoten L, Oberski D L, and de Waal T (2017), ‘Estimating classification errors under edit restrictions in composite survey-register data using multiple imputation latent class modelling (MILC)’, Journal of Official Statistics, Volume 33, pages 921 to 962
Linzer D A and Jeffrey L (2011), ‘poLCA: an R package for polytomous variable latent class analysis’, Journal of Statistical Software, Volume 42, Issue 10, pages 1 to 29
Office for National Statistics (ONS), released 21 June 2019, ONS website, article, Developing our approach for producing admin-based population estimates, England and Wales: 2011 and 2016
Office for National Statistics (ONS), released 6 August 2021, ONS website, article, Producing admin-based ethnicity statistics for England: methods, data and quality
Vermunt J K and Magidson J (2015), ‘Upgrade manual for Latent GOLD 5.1’, Belmont: Statistical Innovations
Back to table of contents9. Cite this methodology
Office for National Statistics (ONS), released 27 January 2023, ONS website, methodology, Exploring the use of multiple imputation latent class modelling (MILC) for administrative data