(censored_data)=
:::{post} May, 2022 :tags: censored, survival analysis :category: intermediate, how-to :author: Luis Mario Domenzain :::
This example notebook on Bayesian survival analysis touches on the point of censored data. Censoring is a form of missing-data problem, in which observations greater than a certain threshold are clipped down to that threshold, or observations less than a certain threshold are clipped up to that threshold, or both. These are called right, left and interval censoring, respectively. In this example notebook we consider interval censoring.
Censored data arises in many modelling problems. Two common examples are:
Survival analysis: when studying the effect of a certain medical treatment on survival times, it is impossible to prolong the study until all subjects have died. At the end of the study, the only data collected for many patients is that they were still alive for a time period $T$ after the treatment was administered: in reality, their true survival times are greater than $T$.
Sensor saturation: a sensor might have a limited range and the upper and lower limits would simply be the highest and lowest values a sensor can report. For instance, many mercury thermometers only report a very narrow range of temperatures.
This example notebook presents two different ways of dealing with censored data in PyMC3:
An imputed censored model, which represents censored data as parameters and makes up plausible values for all censored values. As a result of this imputation, this model is capable of generating plausible sets of made-up values that would have been censored. Each censored element introduces a random variable.
An unimputed censored model, where the censored data are integrated out and accounted for only through the log-likelihood. This method deals more adequately with large amounts of censored data and converges more quickly.
To establish a baseline we compare to an uncensored model of the uncensored data.
We should predict that running the uncensored model on uncensored data, we will get reasonable estimates of the mean and variance.
And that is exactly what we find.
The problem however, is that in censored data contexts, we do not have access to the true values. If we were to use the same uncensored model on the censored data, we would anticipate that our parameter estimates will be biased. If we calculate point estimates for the mean and std, then we can see that we are likely to underestimate the mean and std for this particular dataset and censor bounds.
The figure above confirms this.
The models below show 2 approaches to dealing with censored data. First, we need to do a bit of data pre-processing to count the number of observations that are left or right censored. We also also need to extract just the non-censored data that we observe.
(censored_data/model1)=
In this model, we impute the censored values from the same distribution as the uncensored data. Sampling from the posterior generates possible uncensored data sets.
We can see that the bias in the estimates of the mean and variance (present in the uncensored model) have been largely removed.
Here we can make use of pm.Censored.
Sampling
Again, the bias in the estimates of the mean and variance (present in the uncensored model) have been largely removed.
As we can see, both censored models appear to capture the mean and variance of the underlying distribution as well as the uncensored model! In addition, the imputed censored model is capable of generating data sets of censored values (sample from the posteriors of left_censored and right_censored to generate them), while the unimputed censored model scales much better with more censored data, and converges faster.
:::{include} ../page_footer.md :::