(GLM-truncated-censored-regression)=
:::{post} September, 2022 :tags: censored, generalized linear model, regression, truncated :category: beginner :author: Benjamin T. Vincent :::
The notebook provides an example of how to conduct linear regression when your outcome variable is either censored or truncated.
Truncation and censoring are examples of missing data problems. It can sometimes be easy to muddle up truncation and censoring, so let's look at some definitions.
Let's further explore this with some code and plots. First we will generate some true (x, y) scatter data, where y is our outcome measure and x is some predictor variable.
For this example of (x, y) scatter data, we can describe the truncation process as simply filtering out any data for which our outcome variable y falls outside of a set of bounds.
With censoring however, we are setting the y value equal to the bounds that they exceed.
Based on our generated (x, y) data (which an experimenter would never see in real life), we can generate our actual observed datasets for truncated data (xt, yt) and censored data (xc, yc).
We can visualise this latent data (in grey) and the remaining truncated or censored data (black) as below.
If we were to run regular linear regression on either the truncated or censored data, it should be fairly intuitive to see that we will likely underestimate the slope. Truncated regression and censored regress (aka Tobit regression) were designed to address these missing data problems and hopefully result in regression slopes which are free from the bias introduced by truncation or censoring.
In this section we will run Bayesian linear regression on these datasets to see the extent of the problem. We start by defining a function which defines a PyMC model, conducts MCMC sampling, and returns the model and the MCMC sampling data.
So we can run this on our truncated and our censored data, separately.
By plotting the posterior distribution over the slope parameters we can see that the estimates for the slope are pretty far off, so we are indeed underestimating the regression slope.
To appreciate the extent of the problem (for this dataset) we can visualise the posterior predictive fits alongside the data.
By looking at these plots we can intuitively predict what factors will influence the extent of the underestimation bias. Firstly, if the truncation or censoring bounds are very broad such that they only affect a few data points, then the underestimation bias would be smaller. Secondly, if the measurement error σ is low, we might expect the underestimation bias to decrease. In the limit of zero measurement noise then it should be possible to fully recover the true slope for truncated data but there will always be some bias in the censored case. Regardless, it would be prudent to use truncated or censored regression models unless the measurement error is near zero, or the bounds are so broad as to be practically irrelevant.
Now we have seen the problem of conducting regression on truncated or censored data, in terms of underestimating the regression slopes. This is what truncated or censored regression models were designed to solve. The general approach taken by both truncated and censored regression is to encode our prior knowledge of the truncation or censoring steps in the data generating process. This is done by modifying the likelihood function in various ways.
Truncated regression models are quite simple to implement. The normal likelihood is centered on the regression slope as normal, but now we just specify a normal distribution which is truncated at the bounds.
Truncated regression solves the bias problem by updating the likelihood to reflect our knowledge about the process generating the observations. Namely, we have zero chance of observing any data outside of the truncation bounds, and so the likelihood should reflect this. We can visualise this in the plot below, where compared to a normal distribution, the probability density of a truncated normal is zero outside of the truncation bounds $(y<-1)$ in this case.
Thanks to the new {class}pm.Censored distribution it is really straightforward to write models with censored data. The only thing to remember is that the latent variable being censored must be called with the .dist method, as in pm.Normal.dist in the model above.
Behind the scenes, pm.Censored adjusts the likelihood function to take into account that:
This is demonstrated visually in the plot below. Technically the probability density at the bound is infinite because the bin width exactly at the bound is zero.
Now we can conduct our parameter estimation with the truncated regression model on the truncated data...
and with the censored regression model on the censored data.
We can do the same as before and visualise our posterior estimates on the slope.
These are much better estimates. Interestingly, we can see that the estimate for censored regression is more precise than for truncated data. This will not necessarily always be the case, but the intuition here is that the x and y data is entirely discarded with truncation, but only the y data becomes partially unknown in censoring.
We could speculate then, that if an experimenter had the choice of truncating or censoring data, it might be better to opt for censoring over truncation.
Correspondingly, we can confirm the models are good through visual inspection of the posterior predictive plots.
This brings an end to our guide on truncated and censored data and truncated and censored regression models in PyMC. While the extent of the regression slope estimation bias will vary with a number of factors discussed above, hopefully these examples have convinced you of the importance of encoding your knowledge of the data generating process into regression analyses.
It is also possible to treat the bounds as unknown latent parameters. If these are not known exactly and it is possible to fomulate a prior over these bounds, then it would be possible to infer what the bounds are. This could be argued as overkill however - depending on your data analysis context it may be entirely sufficient to extract 'good enough' point estimates of the bounds in order to get reasonable regression estimates.
The censored regression model presented above takes one particular approach, and there are others. For example, it did not attempt to infer posterior beliefs over the true latent y values of the censored data. It is possible to build censored regression models which do impute these censored y values, but we did not address that here as the topic of imputation deserves its own focused treatment. The PyMC {ref}censored_data example also covers this topic, with a particular {ref}example model to impute censored data <censored_data/model1>._
When looking into this topic, I found that most of the material out there focuses on maximum likelihood estimation approaches, with focus on mathematical derivation rather than practical implementation. One good concise mathematical 80 page booklet by {cite:t}breen1996regression covers truncated and censored as well as other missing data scenarios. That said, a few pages are given over to this topic in Bayesian Data Analysis by {cite:t}gelman2013bayesian, and {cite:t}gelman2020regression.
az.extract, (pymc-examples#522):::{bibliography} :filter: docname in docnames :::
:::{include} ../page_footer.md :::