Missing_Data_Imputation.ipynb

https://github.com/pymc-devs/pymc-examples/blob/main/examples/howto/Missing_Data_Imputation.ipynb

(Bayesian Missing Data Imputation)=

Bayesian Missing Data Imputation

:::{post} February, 2023 :tags: missing data, bayesian imputation, hierarchical :category: advanced :author: Nathaniel Forde :::

Bayesian Imputation and Degrees of Missing-ness

The analysis of data with missing values is a gateway into the study of causal inference.

One of the key features of any analysis plagued by missing data is the assumption which governs the nature of the missing-ness i.e. what is the reason for gaps in our data? Can we ignore them? Should we worry about why? In this notebook we'll see an example of how to handle missing data using maximum likelihood estimation and bayesian imputation techniques. This will open up questions about the assumptions governing inference in the presence of missing data, and inference in counterfactual cases.

We will make the discussion concrete by considering an example analysis of an employee satisfaction survey and how different work conditions contribute to the responses and non-responses we see in the data.

Missing Data Taxonomy

Rubin's famous taxonomy breaks out the question into a choice of three fundamental options:

Missing Completely at Random (MCAR)
Missing at Random (MAR)
Missing Not at Random (MNAR)

Each of these paradigms can be reduced to explicit definition in terms of the conditional probability regarding the pattern of missing data. The first pattern is the least concerning. The (MCAR) assumption states that the data are missing in a manner that is unrelated to both the observed and unobserved parts of the realised data. It is missing due to the haphazard circumstance of the world $\phi$.

$P(M =1 | Y_{obs}, Y_{miss}, \phi) = P(M =1 | \phi)$

whereas the second pattern (MAR) allows that the reasons for missingness can be function of the observed data and circumstances of the world. Some times this is called a case of ignorable missingness because estimation can proceed in good faith on the basis of the observed data. There may be a loss of precision, but the inference should be sound.

$P(M =1 | Y_{obs}, Y_{miss}, \phi) = P(M =1 | Y_{obs}, \phi)$

The most nefarious sort of missing data is when the missingness is a function of something outside the observed data, and the equation cannot be reduced further. Efforts at imputation and estimation more generally may become more difficulty in this final case because of the risk of confounding. This is a case of non-ignorable missing-ness.

$P(M =1 | Y_{obs}, Y_{miss}, \phi)$

These assumptions are made before any analysis begins. They are inherently unverifiable. Your analysis will stand or fall depending on how plausible each assumption is in the context you seek to apply them. For example, an another type missing data results from systematic censoring as discussed in {ref}GLM-truncated-censored-regression. In such cases the reason for censoring governs the missing-ness pattern.

Employee Satisfaction Surveys

We'll follow the presentation of Craig Enders' Applied Missing Data Analysis {cite:t}enders2022 and work with employee satisifaction data set. The data set comprises of a few composite measures reporting employee working conditions and satisfactions. Of particular note are empowerment (empower), work satisfaction (worksat) and two composite survey scores recording the employees leadership climate (climate), and the relationship quality with their supervisor lmx.

The key question is what assumptions governs our patterns of missing data.

We see here the histograms of the employee metrics. It is the gaps in the data that we wish to impute to better understand the relationships between the variables and how gaps in one may be driven by values in the others.

FIML: Full Information Maximum Likelihood

This method of handling missing data is not an imputation method. It uses maximum likelihood estimation to estimate the parameters of the multivariate normal distribution that could be best said to generate our observed data. It's a little trickier than straight forward MLE approaches in that it respects the fact that we have missing data in our original data set, but fundamentally it's the same idea. We want to optimize the parameters of our multivariate normal distribution to best fit the observed data.

The procedure works by partitioning the data into their patterns of "missing-ness" and treating each partition as contributing to the ultimate log-likelihood term that we want to maximise. We combine their contributions to estimate a fit for the multivariate normal distribution.

Sampling from the Implied Distribution

We can then sample from the implied distribution to estimate other features of interest and test against the observed data.

This allows us to compare the implied distributions against the observed data

The Correlation Between the Imputed Metrics Data

We can also calculate other features of interest from our sample. For instance, we might want to know about the correlations between variables in question.

Bootstrapping Sensitivity Analysis

We may also want to validate the estimated parameters against bootstrapped samples under different specifications of missing-ness.

Here we plot the maximum likelihood parameter estimates against various missing data regimes. This approach can be applied for any imputation methodology.

These plots show how under (MCAR) the parameter estimates of our multivariate normal distribution are quite robust to varying degrees of missing data. It's an instructive exercise to attempt a similar simulation exercise under alternative missing data regimes.

Bayesian Imputation

Next we'll apply bayesian methods to the same problem. But here we'll see direct imputation of the missing values using the posterior predictive distribution. The Bayesian approach to imputation is of a different flavour than we saw above. We're not just learning parameters of the data generating distribution (although we are doing that too), the bayesian process directly imputes the missing values for specific missing entries through the process of MCMC sampling.

These results agree with the FIML approach above and the results reported in Ender's Applied Missing Data Analysis.

Bayesian Imputation by Chained Equations

So far we've seen multivariate approaches to imputation which treat each of the variables in our dataset as a collection drawn from the same distribution. However, there is a more flexible approach which is often useful when there is a particular focal relationship that we're interested in analysing.

Sticking with the employee data set we'll examine here the relationship between lmx, climate, male and empower, where our focus is on what drives empowerment. Recall that our gender variable male is fully specified and does not need to be imputed. So we have a joint distribution that can be decomposed:

$f(emp, lmx, climate, male) = f(emp | lmx, climate, male) \cdot f(lmx | climate, male) \cdot f(climate | male) \cdot f(male)^{*}$ *

which can be split out into individual regression equations or more generally component models for each required conditional model.

$empower = \alpha_{2} + \beta_{3}male + \beta_{4}climate + \beta_{5}lmx$ $lmx = \alpha_{1} + \beta_{1}climate + \beta_{2}male$ $climate = \alpha_{0} + \beta_{0}male$

We can impute each of these equations in turn saving the imputed data set and feeding it forward into the next modelling exercise. This adds a little complexity because some of the variables will occur twice. Once as a predictor in our focal regression and once and as likelihood term in their own component model.

PyMC Imputation

As we saw above we can use PyMC to impute the values of missing data by using a particular sampling distribution. In the case of chained equations this becomes a little trickier because we might want to use both the data for lmx as a regressor in one equation and observed data in our likelihood in another.

It also matters how we specify the sampling distribution that will be used to impute our missing data. We'll show an example here where we use a uniform and normal sampling distribution alternatively for imputing the predictor terms in our in focal regression.

Model Fits

Next we'll inspect the parameter fits for our regression models and observe how they're dependent on the prior specification in the imputation scheme.

We can see how the choice of sampling distribution has induced different parameter estimates on the beta coefficients across our two models. The two imputations broadly agrees at the level of the parameters, but they meaningfully differ in their implications.

This difference has downstream effects on the posterior predictive distribution. We can see here how the sampling distribution for the predictor terms influences the posterior predictive fits on our focal regression equation.

Posterior Predictive Distributions

Process the Posterior Predictive Distribution

Above we estimated a number of likelihood terms in a single PyMC model context. These likelihoods constrained the hyper-parameters which determined the imputation values of the missing terms in the variables used as predictors in our focal regression equation for empower. But we could also perform a more manual sequential imputation, where we model each of the subordinate regression equations separately and extract the imputed values for each variable in turn and then run a simple regression on the imputed values for the focal regression equation.

We show here how to extract the imputed values for each of the regression equations and augment the observed data.

We used the mean here to impute the expected value for each missing cell, but you could perform a kind of sensitivity analysis using the many plausible values in posterior predictive distribution

Plotting the Imputed Datasets

Now we'll plot the imputed values against their observed values to show how the different sampling distributions impact the pattern of imputation.

Ultimately our choice of sampling distribution leads to differently plausible imputations. The choice of which model to go with will driven by the assumptions which govern the reasons for missing-ness in our data.

Hierarchical Structures and Data Imputation

Our employee dataset has more fine-grained structure than we've examined so far. In particular there are 100 or so teams which make up our employee pool, and we might wonder to what degree the propensity for satisfaction or incomplete survey scores are due to the local team environments? Could this be a factor in our patterns of missing data? We'll examine the reported empowerment scores by team and plot the regression lines by as predicted within each team by their reported lmx score.

There is enough spread in the regression lines to at least suggest that there is a heterogenous relationship between empowerment and the work environment as we look across different teams, but limited observations in each team. This is a perfect use case for a hierarchical bayesian model.

Some Convergence Checks

Inspecting the Model Fit

Heterogenous Patterns of Imputation

Just as when we consider questions of causal inference and we attend to the confounding influence of local factors, this is also required when we do imputation. We show here a selection of team specific intercept terms which suggest that belonging to a particular team can shift your empowerment above or below the grand mean of the company level intercept term. These local effects of environment are what we seek to account for when imputing missing values.

The ability to capture this local variation impacts the pattern of imputed values too.

It's clear from the hierarchical model that the team specific information has allowed us to impute a wider range of empowerment values with a broader spread as a function of lmx and male. This is much more persuasive since all politics is local, and this latter model is informed by the conditions of work for each employee. As such, our hierarchical model is able to ascribe a more nuanced view of the probable empowerment values for the missing reports. The hierarchical imputation model "borrows information" in two ways (i) the individual team estimates are pulled toward the global estimates and (ii) the missing values are imputed with respect to our measures of the team dynamics.

Conclusion

We've now seen multiple approaches to the imputation of missing data. We have focused on an example where the reason for the missing data is not immediately obvious given how different employees might very well have different reasons for under-specifying their relationship with management. However the techniques applied here are quite general.

The multivariate normal approaches to imputation works surprisingly well in many cases, but the more cutting edge approach is the sequential specification of chained equations. The Bayesian approach here is state of the art because we are quite free to use more than simple regression models as the component models for our imputation equations. For each equation we can be liberal in our choice of likelihood terms and the priors we allow over the sampling distributions. We can also add hierarchical structure to respect natural clusters in our data in so far as they constrain the patterns of missing data.

This general point is important - the flexibility of the Bayesian approach can be tailored to the appropriate complexity of our theory about why our data is missing. Similar considerations apply to the estimation procedures involved in counterfactual inference. The more developed our theory for why the data is missing (why the world is as it is, and not another way), the more we need a flexible modelling framework to capture the subtleties of the theory. Bayesian modelling is a superb tool for this loop of theory construction and evaluation.