GLM-truncated-censored-regression.ipynb

https://github.com/pymc-devs/pymc-examples/blob/main/examples/generalized_linear_models/GLM-truncated-censored-regression.ipynb

(GLM-truncated-censored-regression)=

Bayesian regression with truncated or censored data

:::{post} September, 2022 :tags: censored, generalized linear model, regression, truncated :category: beginner :author: Benjamin T. Vincent :::

The notebook provides an example of how to conduct linear regression when your outcome variable is either censored or truncated.

Truncation and censoring

Truncation and censoring are examples of missing data problems. It can sometimes be easy to muddle up truncation and censoring, so let's look at some definitions.

Truncation is a type of missing data problem where you are simply unaware of any data where the outcome variable falls outside of a certain set of bounds.
Censoring occurs when a measurement has a sensitivity with a certain set of bounds. But rather than discard data outside these bounds, you would record a measurement at the bound which it exceeded.

Let's further explore this with some code and plots. First we will generate some true (x, y) scatter data, where y is our outcome measure and x is some predictor variable.

For this example of (x, y) scatter data, we can describe the truncation process as simply filtering out any data for which our outcome variable y falls outside of a set of bounds.

With censoring however, we are setting the y value equal to the bounds that they exceed.

Based on our generated (x, y) data (which an experimenter would never see in real life), we can generate our actual observed datasets for truncated data (xt, yt) and censored data (xc, yc).

We can visualise this latent data (in grey) and the remaining truncated or censored data (black) as below.

The problem that truncated or censored regression solves

If we were to run regular linear regression on either the truncated or censored data, it should be fairly intuitive to see that we will likely underestimate the slope. Truncated regression and censored regress (aka Tobit regression) were designed to address these missing data problems and hopefully result in regression slopes which are free from the bias introduced by truncation or censoring.

In this section we will run Bayesian linear regression on these datasets to see the extent of the problem. We start by defining a function which defines a PyMC model, conducts MCMC sampling, and returns the model and the MCMC sampling data.

So we can run this on our truncated and our censored data, separately.

By plotting the posterior distribution over the slope parameters we can see that the estimates for the slope are pretty far off, so we are indeed underestimating the regression slope.

To appreciate the extent of the problem (for this dataset) we can visualise the posterior predictive fits alongside the data.

By looking at these plots we can intuitively predict what factors will influence the extent of the underestimation bias. Firstly, if the truncation or censoring bounds are very broad such that they only affect a few data points, then the underestimation bias would be smaller. Secondly, if the measurement error σ is low, we might expect the underestimation bias to decrease. In the limit of zero measurement noise then it should be possible to fully recover the true slope for truncated data but there will always be some bias in the censored case. Regardless, it would be prudent to use truncated or censored regression models unless the measurement error is near zero, or the bounds are so broad as to be practically irrelevant.

Implementing truncated and censored regression models

Now we have seen the problem of conducting regression on truncated or censored data, in terms of underestimating the regression slopes. This is what truncated or censored regression models were designed to solve. The general approach taken by both truncated and censored regression is to encode our prior knowledge of the truncation or censoring steps in the data generating process. This is done by modifying the likelihood function in various ways.

Truncated regression model

Truncated regression models are quite simple to implement. The normal likelihood is centered on the regression slope as normal, but now we just specify a normal distribution which is truncated at the bounds.

Truncated regression solves the bias problem by updating the likelihood to reflect our knowledge about the process generating the observations. Namely, we have zero chance of observing any data outside of the truncation bounds, and so the likelihood should reflect this. We can visualise this in the plot below, where compared to a normal distribution, the probability density of a truncated normal is zero outside of the truncation bounds $(y<-1)$ in this case.

Censored regression model

Thanks to the new {class}pm.Censored distribution it is really straightforward to write models with censored data. The only thing to remember is that the latent variable being censored must be called with the .dist method, as in pm.Normal.dist in the model above.

Behind the scenes, pm.Censored adjusts the likelihood function to take into account that:

the probability at the lower bound is equal to the cumulative distribution function from $-\infty$ to the lower bound,
the probability at the upper bound is equal to the the cumulative distribution function from the upper bound to $\infty$.

This is demonstrated visually in the plot below. Technically the probability density at the bound is infinite because the bin width exactly at the bound is zero.

Run the truncated and censored regressions

Now we can conduct our parameter estimation with the truncated regression model on the truncated data...

and with the censored regression model on the censored data.

We can do the same as before and visualise our posterior estimates on the slope.

These are much better estimates. Interestingly, we can see that the estimate for censored regression is more precise than for truncated data. This will not necessarily always be the case, but the intuition here is that the x and y data is entirely discarded with truncation, but only the y data becomes partially unknown in censoring.

We could speculate then, that if an experimenter had the choice of truncating or censoring data, it might be better to opt for censoring over truncation.

Correspondingly, we can confirm the models are good through visual inspection of the posterior predictive plots.

This brings an end to our guide on truncated and censored data and truncated and censored regression models in PyMC. While the extent of the regression slope estimation bias will vary with a number of factors discussed above, hopefully these examples have convinced you of the importance of encoding your knowledge of the data generating process into regression analyses.

Further topics

It is also possible to treat the bounds as unknown latent parameters. If these are not known exactly and it is possible to fomulate a prior over these bounds, then it would be possible to infer what the bounds are. This could be argued as overkill however - depending on your data analysis context it may be entirely sufficient to extract 'good enough' point estimates of the bounds in order to get reasonable regression estimates.

The censored regression model presented above takes one particular approach, and there are others. For example, it did not attempt to infer posterior beliefs over the true latent y values of the censored data. It is possible to build censored regression models which do impute these censored y values, but we did not address that here as the topic of imputation deserves its own focused treatment. The PyMC {ref}censored_data example also covers this topic, with a particular {ref}example model to impute censored data <censored_data/model1>._

Authors

Authored by Benjamin T. Vincent in May 2021
Updated by Benjamin T. Vincent in January 2022
Updated by Benjamin T. Vincent in September 2022
Updated by Benjamin T. Vincent in February 2023 to run on PyMC v5
Updated by Benjamin T. Vincent in February 2023 to use az.extract, (pymc-examples#522)

References

:::{bibliography} :filter: docname in docnames :::

Watermark

:::{include} ../page_footer.md :::