(GLM-missing-values-in-covariates.ipynb)=
:::{post} Nov 09, 2024 :tags: missing covariate values, missing values, auto-imputation, linear regression, bayesian workflow :category: intermediate, reference :author: Jonathan Sedar :::
Minimal Reproducible Example: Workflow to handle missing data in multiple covariates (numeric predictor features)
Here we demonstrate automatic imputation of missing values within multiple covariates (numeric predictor features):
Disclaimer:
We often encounter real-world situations and datasets where a predictor feature is numeric and where observations contain missing values in that feature.
Missing values break the model inference, because the log-likelihood for those observations can't be computed.
We have a few options to mitigate the situation:
enders2022 and contain latent information about a biased or lossy data-generating process, then we
might choose to remove the observations with missing vales or ignore the features that contain missing valuesHere we demonstrate method(s) to auto-impute missing values as part of the pymc posterior sampling process. We get
inference and prediction as usual, but also auto-imputed values for the missing values. Pretty neat!
Our problem statement is that when faced with data with missing values, we want to:
This notebook takes the opportunity to:
pymc folklore but rarely
shown in full. A helpful primer and related discussion is this PyMCon2020 talk: {cite:p}junpenglao2020gelman2020bayesian including data creationThis notebook is a partner to another pymc-examples notebook Missing_Data_Imputation.ipynb
which goes into more detail of taxonomies and a much more complicated dataset and tutorial-style worked example.
:::{include} ../extra_installs.md :::
IMPLEMENTATION NOTE
pymc's auto-imputation routineenders202240%)Observe:
a and b are full, complete, without missing valuesc and d contain missing values, where observations can even contain missing values for both featuresNOTE we dont need any further steps to prepare the dataset (e.g. clean observations & features, force datatypes, set indexes, etc), so we will move straight to EDA and transformations for model input
yObserve:
y Looks smooth, reasonably central, can probably model with a Normal likelihooda, b, c, dObserve:
a, b, c, d Vary across the range, but all reasonably central, can probably model with Normal distributionsc, d each contain 16 NaN valuesObserve:
a, b, c, d has a correlation to y, some stronger, some weakerdfx for model inputIMPORTANT NOTE
test dataset (nor k-fold cross validation)
to fit parameters.holdout (out-of-sample) dataset (that contains target y values) to evaluate model
performance, because we can use in-sample PPC, LOO-PIT and ELPD evaluationsholdout set (even though we dont need it) to simply eyeball the behaviour of predictive outputsforecast set (which does not have target y values) to demonstrate how we would use the model and it's
predictive outputs on future new data in Productionholdout set, in order to eyeball the predictive outputs, and also to eyeball the auto-imputed
missing values compared to the true synthetic data. This is only possible because we synthesized the true data above.df created above is our "working" dataset, which we will partition into "train" and
"holdout"Dataset terminology / partitioning / purpose:
df_train and df_holdoutNOTE
df, so the ratio of this split doesn't really matter,
and we'll arrange to have 10 observations in the holdout setcount of non-nulls in the below tables to ensure we have missing values in both features c, d in
both datasets train and holdoutdfx_trainTransform (zscore and scale) numerics
dfx_holdoutModel0: Baseline without Missing ValuesThis section might seem unusual or unnecessary, but will hopefully provide a useful comparison for general behaviour
and help to further explain the model architecture used in ModelA.
We will create Model0 using the same general linear model, operating on the dfrawx_train dataset without any
missing values:
where:
observation_id aka oid) contain numeric features $j$ that have complete values (this is all features a, b, c, d)y with linear sub-model $\beta_{j}^{T}\mathbb{x}_{ij}$
to regress onto those featuresdfrawThis is a lightly simplified copy of the same logic / workflow in $\S0.3$ above. We won't take up any more space here
with EDA, the only difference is c and d are now complete
Partition dfraw into dfraw_train and dfraw_holdout, use same indexes as df_train and df_holdout
Create dfrawx_train: Transform (zscore and scale) numerics
Create dfrawx_holdout
Note the inevitable (but slight) difference in MNS vs MNS_RAW and SDEVS vs SDEVS_RAW
Observe:
Observe:
Observe:
beta_sigma, beta_j: (levels), epsilon all have reasonable prior ranges as specifiedObserve:
ess_bulk is good, r_hat is goodE-BFMI >> 0.3 so is
apparently reasonableObserve:
yhat tracks the observed y very closelyObserve:
LOO-PIT looks good, very slightly overdispersed but more than acceptable for useObserve:
beta_sigma, beta_j: (levels), epsilon all smooth and central as specifiedbeta_j levels to compare relative effectsObserve:
dfrawx_holdout setdfrawx_holdoutyhatyhat to known true value yyhat from PPC idata, and attach real values (only available because it's a holdout set)yhat vs known true values y (only available because it's a holdout set)Observe:
yhat look very close to the true value y, usually well within the $HDI_{94}$ and $HDI_{50}$yhat are useful too: quantifing the uncertainty in prediction and letting us
make better decisions accordingly.ModelA: Auto-impute Missing ValuesNow we progress to handling missing values!
ModelA is an extension of Model0 with a simple linear submodel with a hierarchical prior on the data for
features $k$ that have missing values:
where:
dfx data and can expect a degree of centrality around 0. Of course, the actual distributions of data
in $\mathbb{x}_{ik}$ could be highly skewed, so a Normal is not necessarily the best choice, but a good place to starty with linear sub-models $\beta_{j}^{T}\mathbb{x}{ij}$ and
$\beta{k}^{T}\mathbb{x}_{ik}$ to regress onto those featuresIMPLEMENTATION NOTE
There are a few ways to handle missing values in pymc. Here in ModelA we make use of pymc "auto-imputation",
where we instruct the model to "observe" a slice of the dataset dfx_train[FTS_XK] for the features that contain
missing values , and the pymc build processes will very kindly handle the missing values and provide auto-imputation
at element level.
This is convenient and clean in the model specification, but does require further manipulation of the model and idata to create out-of-sample predictions, see $\S 2.5$ for detail
In particular:
The auto-imputation routine will create flattened xk_observed and xk_unobserved RVs for us, based on a (single)
specified distribution of xk, and a final Deterministic for xk with the correct dimensions
However, an important limitation is that currently, pm.Data cannot contain NaNs (nor a masked_array) so
we can't use the usual workflow mdl.set_data() to replace the in-sample dataset with an out-of-sample dataset
to make predictions!
For example, neither of these constructs is possible:
NaNs into pm.Data
masked_array into pm.Data
Also see further discussion in pymc issue #6626, and proposed new functionality #7204
Finally, note that some earlier examples of auto-imputation involve creating a np.ma.masked_array and passing that
into the observed RV. As at Dec 2024 this appears to no longer be necessary, and we can directly pass in the dataframe
For example, this:
can now be simply stated as:
Observe:
xk_observed (observed data) and xk_unobserved (unobserved free RV), which
each have same distribution as we specified for xkxk is now represented by a new Deterministic node with a function of the two (concatenation and reshaping)xk_unobserved has a new, flattened shape with length equal to the count of NaNs in the relevant
features $k$, and it loses the dims that we assigned to xk. This is an unhelpful current limitation and means we
have to do some indexing to recover the PPC values laterObserve:
Observe:
beta_sigma, beta_j: (levels), beta_k: (levels), epsilon all have reasonable prior ranges as specifiedObserve:
xk_mu (levels) have reasonable prior ranges as specifiedxk_unobserved)Observe:
xk_unobserved are of course all the same, and in reasonable ranges as specifiedc and dObserve:
ess_bulk is good, r_hat is goodE-BFMI >> 0.3 (and is
apparently reasonable), note the
values are lower than for Model0. This is an effect of the missing data.Observe:
yhat tracks the observed y moderately well: slightly overdispersed, perhaps a likelihood
with fatter tails would be more appropriate (e.g. StudentT)Observe:
LOO-PIT looks good, again slightly overdispersed but acceptable for useObserve:
ModelA does of course suffer in comparison to Model0 which has the benefit
of the complete dataset (no missing values), but it's not that much worse, and (of course) we've been able to handle
missing values in the in-sample data!Observe:
beta_sigma, beta_j: (levels), beta_k: (levels), epsilon all have reasonable prior ranges as specifiedbeta_j and beta_k levels to compare relative effectsObserve:
Model0 in $\S1.4$ above:Observe:
0, which is reasonable because the missingness is at randomREFVALS_X_MUxk_unobserved)Observe:
xk_unobserved, with quantified uncertaintyxk_unobserved to known true valuesNOTE
dfraw abovexk, because xk_unobserved doesn't have the right dims indexesxk_unobserved vs known true values (only possible in this synthetic example)Observe:
dfx_trainc, and d in df_traino03) have missing values in both c and d, others (e.g o04) have only oneo00, o8, o10 are good examples), we see the
possibility of a lack of identifiability: this is an interesting and not easily avoided side-effect of the data and
model architecture, and in the real-world we might seek to mitigate through removing observations or features.dfx_holdout setIMPLEMENTATION NOTE
The following process is a bit of a hack:
dfx_holdout, because (as noted above $\S 2.1$
Build Model Object ) we can't put NaNs (or a masked_array) into a pm.Data. If we could do that then we
could simply use a pm.set_data(), but we can't so we don't.xk_unobserved, because this is a precursor to computing yhat, and we can't specify
a conditional order in sample_posterior_predictiveyhatREALITIES
dfx_holdoutxk_unobserved in out-of-sample datasetNOTE
ida, instead take a deepcopy ida_h , remove unnecessary groups, and we'll use thataz.InferenceData then add groups, because we have to add all sorts of additional subtle
info to the object. Easier to copy and remove groupsxarray indexing in posterior will be wrong (set according to dfx_train, rather than dfx_holdout),
specifically dimension oid and coordinate oid.xarray.Dataset is a total nightmare so we won't attempt to change them. It won't
matter in this case anyway because we won't refer to the posterior.xk_unobserved into idata and sample PPC for yhatObserve
ida_h we have Datasets for predictions and predictions_constant_data as if we'd done only one stepmdla_h can be safely removedxk_unobserved to known true valuesNOTE
dfraw abovexk, because xk_unobserved doesn't have the right dims indexesxk_unobserved from PPC idata , isolate via index, transform back into data domainxk_unobserved vs known true values (only possible in this synthetic example)Observe:
xk_unobserved have been pulled from the hierarchical xk_mu distributions and are not
conditional on anything else, so we get pretty much all the same predicted valuedfx_holdout), these auto-imputed
values for xk_unobserved can't be any more informative than the posterior distribution of the hierarchical
prior xk_mu.yhat to known true valuesyhat from PPC idata, and attach real values (only available because it's a holdout set)yhat vs known true values y (only available because it's a holdout set)Observe:
yhat look pretty close to the true value y, usually well-within the $HDI_{94}$yhat are useful too: quantifing the uncertainty in prediction and letting us
make better decisions accordingly.:::{bibliography} :filter: docname in docnames :::
:::{include} ../page_footer.md :::