(GLM-missing-values-in-covariates.ipynb)=

GLM-missing-values-in-covariates

:::{post} Nov 09, 2024 :tags: missing covariate values, missing values, auto-imputation, linear regression, bayesian workflow :category: intermediate, reference :author: Jonathan Sedar :::

Minimal Reproducible Example: Workflow to handle missing data in multiple covariates (numeric predictor features)

Automatic Imputation of Missing Values In Covariates: Worked Example with Bayesian Workflow

Here we demonstrate automatic imputation of missing values within multiple covariates (numeric predictor features):

Disclaimer:

  • This Notebook is a worked example only, it's not intended to be an academic reference
  • The theory and math may be incorrect, incorrectly notated, or incorrectly used
  • The code may contain errors, inefficiencies, legacy methods, and the text may have typos
  • Use at your own risk!

Contents



Discussion

Problem Statement

We often encounter real-world situations and datasets where a predictor feature is numeric and where observations contain missing values in that feature.

Missing values break the model inference, because the log-likelihood for those observations can't be computed.

We have a few options to mitigate the situation:

  • Firstly, we should always try to learn why the values are missing. If they are not Missing Completely At Random (MCAR) {cite:p}enders2022 and contain latent information about a biased or lossy data-generating process, then we might choose to remove the observations with missing vales or ignore the features that contain missing values
  • If we believe the values are Missing Completely At Random (MCAR), then we might choose to auto-impute the missing values so that we can make use of all the available observations. This is particularly acute when we have few observations and/or a high-degree of missingness.

Here we demonstrate method(s) to auto-impute missing values as part of the pymc posterior sampling process. We get inference and prediction as usual, but also auto-imputed values for the missing values. Pretty neat!

Data & Models Demonstrated

Our problem statement is that when faced with data with missing values, we want to:

  1. Infer the missing values for the in-sample dataset and sample full posterior parameters
  2. Predict the endogenous feature and the missing values for an out-of-sample dataset

This notebook takes the opportunity to:

  • Demonstrate a general method using auto-imputation, which is often mentioned in pymc folklore but rarely shown in full. A helpful primer and related discussion is this PyMCon2020 talk: {cite:p}junpenglao2020
  • Demonstrate a reasonably complete Bayesian workflow {cite:p}gelman2020bayesian including data creation

This notebook is a partner to another pymc-examples notebook Missing_Data_Imputation.ipynb which goes into more detail of taxonomies and a much more complicated dataset and tutorial-style worked example.



Setup

:::{include} ../extra_installs.md :::



0. Curate Dataset

IMPLEMENTATION NOTE

  • The auto-imputation and full workflow will be simplest to demonstrate if we can compare to true values for the missing data, which means we have to synthesize the dataset
  • We will create create at least 2 features that contain missing values, in order to demonstrate the flattening effect of pymc's auto-imputation routine
  • We create missing values that are Missing At Random (MAR) {cite:p}enders2022
  • We also take the opportunity to make the missingness a real problem, with a large proportion of values missing (40%)
  • For simplicity, we will assume the original features are each Normally distributed: to make this example harder, the reader could experiment with other distributions

0.1 Create Synthetic Dataset

0.1.0 Create "true" (unobserved) dataset (without missing values)

0.1.1 Force 2 features to contain 40% unobserved missing values

Observe:

  • Features a and b are full, complete, without missing values
  • Features c and d contain missing values, where observations can even contain missing values for both features

NOTE we dont need any further steps to prepare the dataset (e.g. clean observations & features, force datatypes, set indexes, etc), so we will move straight to EDA and transformations for model input

0.2 Very limited quick EDA

0.2.1 Univariate: target y

Observe:

  • y Looks smooth, reasonably central, can probably model with a Normal likelihood

0.2.2 Univariate: predictors a, b, c, d

Observe:

  • a, b, c, d Vary across the range, but all reasonably central, can probably model with Normal distributions
  • c, d each contain 16 NaN values

0.2.3 Bivariate: target vs predictors

Observe:

  • Each of features a, b, c, d has a correlation to y, some stronger, some weaker
  • This looks fairly realistic

0.3 Transform dataset to dfx for model input

IMPORTANT NOTE

  • Reminder that Bayesian inferential methods do not need a test dataset (nor k-fold cross validation) to fit parameters.
  • We also do not need a holdout (out-of-sample) dataset (that contains target y values) to evaluate model performance, because we can use in-sample PPC, LOO-PIT and ELPD evaluations
  • Depending on the real-world model implementation we might:
    • Create a separate holdout set (even though we dont need it) to simply eyeball the behaviour of predictive outputs
    • Create a forecast set (which does not have target y values) to demonstrate how we would use the model and it's predictive outputs on future new data in Production
  • Here we do take a holdout set, in order to eyeball the predictive outputs, and also to eyeball the auto-imputed missing values compared to the true synthetic data. This is only possible because we synthesized the true data above.
  • Per the following terminology, df created above is our "working" dataset, which we will partition into "train" and "holdout"

Dataset terminology / partitioning / purpose:

text

0.3.1 Separate df_train and df_holdout

NOTE

  • We have full control over how many observations we create in df, so the ratio of this split doesn't really matter, and we'll arrange to have 10 observations in the holdout set
  • Eyeball the count of non-nulls in the below tables to ensure we have missing values in both features c, d in both datasets train and holdout

0.3.2 Create dfx_train

Transform (zscore and scale) numerics

0.3.3 Create dfx_holdout



1. Model0: Baseline without Missing Values

This section might seem unusual or unnecessary, but will hopefully provide a useful comparison for general behaviour and help to further explain the model architecture used in ModelA.

We will create Model0 using the same general linear model, operating on the dfrawx_train dataset without any missing values:

σβInverseGamma(11,10)βjNormal(0,σβ,shape=j)ϵInverseGamma(11,10)yi^Normal(μ=βjTxij,σ=ϵ)\begin{align} \sigma_{\beta} &\sim \text{InverseGamma}(11, 10) \\ \beta_{j} &\sim \text{Normal}(0, \sigma_{\beta}, \text{shape}=j) \\ \\ \epsilon &\sim \text{InverseGamma}(11, 10) \\ \hat{y_{i}} &\sim \text{Normal}(\mu=\beta_{j}^{T}\mathbb{x}_{ij}, \sigma=\epsilon) \\ \end{align}

where:

  • Observations $i$ (observation_id aka oid) contain numeric features $j$ that have complete values (this is all features a, b, c, d)
  • Our target is $\hat{y_{i}}$, here of y with linear sub-model $\beta_{j}^{T}\mathbb{x}_{ij}$ to regress onto those features

1.0 Quickly prepare non-missing datasets based on dfraw

This is a lightly simplified copy of the same logic / workflow in $\S0.3$ above. We won't take up any more space here with EDA, the only difference is c and d are now complete

Partition dfraw into dfraw_train and dfraw_holdout, use same indexes as df_train and df_holdout

Create dfrawx_train: Transform (zscore and scale) numerics

Create dfrawx_holdout

Note the inevitable (but slight) difference in MNS vs MNS_RAW and SDEVS vs SDEVS_RAW

1.1 Build Model Object

Verify the built model structure matches our intent, and validate the parameterization

Observe:

  • This is a very straightforward model

1.2 Sample Prior Predictive, View Diagnostics

1.2.1 In-Sample Prior PPC (Retrodictive Check)

Observe:

  • The prior PPC is wrong as expected, because we've set relatively uninformative priors
  • However the general range and scale is reasonable and the sampler should be able to find the highest likelihood latent space easily

1.2.2 Quick look at selected priors

Coefficients etc

Observe:

  • Model priors beta_sigma, beta_j: (levels), epsilon all have reasonable prior ranges as specified

1.3 Sample Posterior, View Diagnostics

1.3.1 Sample Posterior and PPC

1.3.2 View Traces

Observe:

  • Samples well-mixed and well-behaved: ess_bulk is good, r_hat is good
  • Marginal energy | energy transition looks a little mismatched, but E-BFMI >> 0.3 so is apparently reasonable

1.3.3 In-Sample Posterior PPC (Retrodictive Check)

Observe:

  • In-sample PPC yhat tracks the observed y very closely

1.3.4 In-Sample PPC LOO-PIT

Observe:

  • LOO-PIT looks good, very slightly overdispersed but more than acceptable for use

1.4 Evaluate Posterior Parameters

1.4.1 Coefficients etc

Observe:

  • Posteriors for model coeffs beta_sigma, beta_j: (levels), epsilon all smooth and central as specified

For interest's sake forestplot the beta_j levels to compare relative effects

Observe:

  • Very tight and distinct posterior distributions
  • The levels broadly correspond to what we would expect to see, given the synthetic data creation

1.5 Create PPC Forecast on dfrawx_holdout set

1.5.1 Replace dataset with dfrawx_holdout

1.5.2 Sample PPC for yhat

1.5.3 Out-of-sample: Compare forecasted yhat to known true value y

Extract yhat from PPC idata, and attach real values (only available because it's a holdout set)
Plot posterior yhat vs known true values y (only available because it's a holdout set)

Observe:

  • The predictions yhat look very close to the true value y, usually well within the $HDI_{94}$ and $HDI_{50}$
  • As we would expect, the distributions of yhat are useful too: quantifing the uncertainty in prediction and letting us make better decisions accordingly.


2. ModelA: Auto-impute Missing Values

Now we progress to handling missing values!

ModelA is an extension of Model0 with a simple linear submodel with a hierarchical prior on the data for features $k$ that have missing values:

σβInverseGamma(11,10)βjNormal(0,σβ,shape=j)βkNormal(0,σβ,shape=k)μkNormal(0,1,shape=k)xikNormal(μk,1,shape=(i,k))ϵInverseGamma(11,10)yi^Normal(μ=βjTxij+βkTxik,σ=ϵ)\begin{align} \sigma_{\beta} &\sim \text{InverseGamma}(11, 10) \\ \beta_{j} &\sim \text{Normal}(0, \sigma_{\beta}, \text{shape}=j) \\ \beta_{k} &\sim \text{Normal}(0, \sigma_{\beta}, \text{shape}=k) \\ \\ \mu_{k} &\sim \text{Normal}(0, 1, \text{shape}=k) \\ \mathbb{x}_{ik} &\sim \text{Normal}(\mu_{k}, 1, \text{shape}=(i,k)) \\ \\ \epsilon &\sim \text{InverseGamma}(11, 10) \\ \hat{y_{i}} &\sim \text{Normal}(\mu=\beta_{j}^{T}\mathbb{x}_{ij} + \beta_{k}^{T}\mathbb{x}_{ik}, \sigma=\epsilon) \\ \end{align}

where:

  • Observations $i$ contain numeric features $j$ that have complete values, and numeric features $k$ that contain missing values
  • For the purposes of this example, we assume that in future, features $j$ will always be complete, and missing values can occur in $k$, and design the model accordingly
  • This is a big assumption, because missing values could theoretically occur in any feature, but we extend the example to assume we always require that features $j$ are complete
  • We treat data $\mathbb{x}_{ik}$ as a random variable, and "observe" the non-missing values.
  • We will assume the missing data values to have a Normal distribution with mean $\mu_{k}$ - this is reasonable because we zscored the dfx data and can expect a degree of centrality around 0. Of course, the actual distributions of data in $\mathbb{x}_{ik}$ could be highly skewed, so a Normal is not necessarily the best choice, but a good place to start
  • Our target is $\hat{y_{i}}$, here of y with linear sub-models $\beta_{j}^{T}\mathbb{x}{ij}$ and $\beta{k}^{T}\mathbb{x}_{ik}$ to regress onto those features

IMPLEMENTATION NOTE

There are a few ways to handle missing values in pymc. Here in ModelA we make use of pymc "auto-imputation", where we instruct the model to "observe" a slice of the dataset dfx_train[FTS_XK] for the features that contain missing values , and the pymc build processes will very kindly handle the missing values and provide auto-imputation at element level.

This is convenient and clean in the model specification, but does require further manipulation of the model and idata to create out-of-sample predictions, see $\S 2.5$ for detail

In particular:

  • The auto-imputation routine will create flattened xk_observed and xk_unobserved RVs for us, based on a (single) specified distribution of xk, and a final Deterministic for xk with the correct dimensions

  • However, an important limitation is that currently, pm.Data cannot contain NaNs (nor a masked_array) so we can't use the usual workflow mdl.set_data() to replace the in-sample dataset with an out-of-sample dataset to make predictions!

  • For example, neither of these constructs is possible:

    1. Cannot insert NaNs into pm.Data
      python
    2. Cannot insert masked_array into pm.Data
      python
  • Also see further discussion in pymc issue #6626, and proposed new functionality #7204

  • Finally, note that some earlier examples of auto-imputation involve creating a np.ma.masked_array and passing that into the observed RV. As at Dec 2024 this appears to no longer be necessary, and we can directly pass in the dataframe

    For example, this:

    python

    can now be simply stated as:

    python

2.1 Build Model Object

Verify the built model structure matches our intent, and validate the parameterization

Observe:

  • We have two new auto-created nodes xk_observed (observed data) and xk_unobserved (unobserved free RV), which each have same distribution as we specified for xk
  • The original xk is now represented by a new Deterministic node with a function of the two (concatenation and reshaping)
  • In particular note xk_unobserved has a new, flattened shape with length equal to the count of NaNs in the relevant features $k$, and it loses the dims that we assigned to xk. This is an unhelpful current limitation and means we have to do some indexing to recover the PPC values later

2.2 Sample Prior Predictive, View Diagnostics

2.2.1 In-Sample Prior PPC (Retrodictive Check)

Observe:

  • Values are wrong as expected, but range is reasonable

2.2.2 Quick look at selected priors

Coefficients etc

Observe:

  • Model priors beta_sigma, beta_j: (levels), beta_k: (levels), epsilon all have reasonable prior ranges as specified

Hierarchical values $\mu_{k}$ for missing data in $\mathbb{x}_{k}$

Observe:

  • Data imputation hierarchical priors xk_mu (levels) have reasonable prior ranges as specified

The values of missing data in $x_{k}$ (xk_unobserved)

Observe:

  • Prior values for the auto-imputed data xk_unobserved are of course all the same, and in reasonable ranges as specified
  • Note again that this is a flattened RV with length equal to the count of NaNs in features c and d

2.3 Sample Posterior, View Diagnostics

2.3.1 Sample Posterior and PPC

2.3.2 View Traces

Observe:

  • Samples well-mixed and well-behaved: ess_bulk is good, r_hat is good
  • Marginal energy | energy transition looks a little mismatched: whilst E-BFMI >> 0.3 (and is apparently reasonable), note the values are lower than for Model0. This is an effect of the missing data.

2.3.3 In-Sample Posterior PPC (Retrodictive Check)

Observe:

  • In-sample PPC yhat tracks the observed y moderately well: slightly overdispersed, perhaps a likelihood with fatter tails would be more appropriate (e.g. StudentT)

2.3.4 In-Sample PPC LOO-PIT

Observe:

  • LOO-PIT looks good, again slightly overdispersed but acceptable for use

2.3.5 Compare Log-Likelihood vs Other Models

Observe:

  • Very interesting: our auto-imputing ModelA does of course suffer in comparison to Model0 which has the benefit of the complete dataset (no missing values), but it's not that much worse, and (of course) we've been able to handle missing values in the in-sample data!

2.4 Evaluate Posterior Parameters

2.4.1 Coefficients etc

Observe:

  • Model coeffs beta_sigma, beta_j: (levels), beta_k: (levels), epsilon all have reasonable prior ranges as specified

For interest's sake forestplot the beta_j and beta_k levels to compare relative effects

Observe:

  • Very tight and distinct posterior distributions
  • Loosely compare this to the same plot for Model0 in $\S1.4$ above:

2.4.2 Hierarchical values for missing data in $\mathbb{x}_{k}$

Observe:

  • Data imputation hierarchical priors haven't moved far from 0, which is reasonable because the missingness is at random
  • However, they do show slight differences which is encouraging of them picking up the inherent differences in the primary raw data due to REFVALS_X_MU

2.4.3 View auto-imputed values of missing data in $x_{k}$ (xk_unobserved)

Observe:

  • We have used our model to autoimpute missing values in xk_unobserved, with quantified uncertainty
  • With the appropriate post-model indexing, we can use these as posterior predictions of the true values of the missing data
  • We'll show that indexing and a special comparison to the synthetic known values next

2.4.4 In-sample: Compare auto-imputed values $x_{k}$ xk_unobserved to known true values

NOTE

  • We can only compare because it's a synthetic dataset where we created those complete (full) values in dfraw above

Create index to extract appropriate obs from xk, because xk_unobserved doesn't have the right dims indexes

Extract from PPC idata , isolate via index, transform back into data domain
Attach real values (only available because synthetic)
Force dtypes for plotting
Plot posterior xk_unobserved vs known true values (only possible in this synthetic example)

Observe:

  • Here's our auto-imputed posterior distributions (boxplots) for the missing data on the in-sample dataset dfx_train
  • These are a (very helpful!) side-effect of our model construction and let us fill-in the real-world missing values for c, and d in df_train
  • Some observations (e.g. o03) have missing values in both c and d, others (e.g o04) have only one
  • We also overplot the known true values from the synthetic dataset: and the match is close for all: usually well-within the HDI94
  • Where observations have more than one missing value (e.g. o00, o8, o10 are good examples), we see the possibility of a lack of identifiability: this is an interesting and not easily avoided side-effect of the data and model architecture, and in the real-world we might seek to mitigate through removing observations or features.

2.5 Create PPC Forecast on dfx_holdout set

IMPLEMENTATION NOTE

The following process is a bit of a hack:

  1. Firstly: We need to re-specify the model entirely using dfx_holdout, because (as noted above $\S 2.1$ Build Model Object ) we can't put NaNs (or a masked_array) into a pm.Data. If we could do that then we could simply use a pm.set_data(), but we can't so we don't.
  2. Secondly: Sample_ppc the xk_unobserved, because this is a precursor to computing yhat, and we can't specify a conditional order in sample_posterior_predictive
  3. Thirdly: Use those predictions to sample_ppc the yhat

REALITIES

  • This process is suboptimal for a real-world scenario wherein we want to forecast new incoming data, because we have to keep re-specifying the model in Step 1 (which opens opportunities for simple human error), and Steps 2 & 3 involve manipulations of idata objects, which is a faff
  • It should still be suitable for a relatively slow, (potentially batched) forecasting process on the order of seconds, not sub-second response times
  • In any case, if this were to be deployed to handle a stream of sub-second inputs, a much simpler way to rectify the situation would be to ensure proper data validation / hygiene upstream and require no missing data!

2.5.1 Firstly: Rebuild model entirely, using dfx_holdout

2.5.2 Secondly: sample PPC for missing values xk_unobserved in out-of-sample dataset

NOTE

  • Avoid changing ida, instead take a deepcopy ida_h , remove unnecessary groups, and we'll use that
  • We won't create a bare az.InferenceData then add groups, because we have to add all sorts of additional subtle info to the object. Easier to copy and remove groups
  • The xarray indexing in posterior will be wrong (set according to dfx_train, rather than dfx_holdout), specifically dimension oid and coordinate oid.
  • Changing things like this inside an xarray.Dataset is a total nightmare so we won't attempt to change them. It won't matter in this case anyway because we won't refer to the posterior.

2.5.3 Thirdly: sub the predictions for xk_unobserved into idata and sample PPC for yhat

Observe

  • Finally in ida_h we have Datasets for predictions and predictions_constant_data as if we'd done only one step
  • mdla_h can be safely removed

2.5.4 Out-of-sample: Compare auto-imputed values $x_{k}$ xk_unobserved to known true values

NOTE

  • Recall we can only do this because it's a synthetic dataset where we created those complete (full) values in dfraw above
  • We will show a plot that might surprise the reader...
Create index to extract appropriate obs from xk, because xk_unobserved doesn't have the right dims indexes
Extract xk_unobserved from PPC idata , isolate via index, transform back into data domain
Attach real values (only possible in this synthetic example)
Force dtypes for plotting
Plot posterior xk_unobserved vs known true values (only possible in this synthetic example)

Observe:

  • Excellent: this looks like we have a terrible prediction - but the model is working as expected, and this is helpful to illustrate
  • The posterior values of xk_unobserved have been pulled from the hierarchical xk_mu distributions and are not conditional on anything else, so we get pretty much all the same predicted value
  • This should drive home the understanding that while technically this model can handle new missing values, and does auto-impute values for missing data in an out-of-sample dataset (here dfx_holdout), these auto-imputed values for xk_unobserved can't be any more informative than the posterior distribution of the hierarchical prior xk_mu.

2.5.5 Out-of-sample: Compare forecasted $\hat{y}$ yhat to known true values

Extract yhat from PPC idata, and attach real values (only available because it's a holdout set)
Plot posterior yhat vs known true values y (only available because it's a holdout set)

Observe:

  • The predictions yhat look pretty close to the true value y, usually well-within the $HDI_{94}$
  • As we would expect, the distributions of yhat are useful too: quantifing the uncertainty in prediction and letting us make better decisions accordingly.


Errata

Authors

Reference

:::{bibliography} :filter: docname in docnames :::

Watermark

:::{include} ../page_footer.md :::