(excess_deaths)=
:::{post} July, 2022 :tags: counterfactuals, causal inference, time series, case study, Bayesian workflow, forecasting, causal impact, regression, posterior predictive :category: intermediate :author: Benjamin T. Vincent :::
Causal reasoning and counterfactual thinking are really interesting but complex topics! Nevertheless, we can make headway into understanding the ideas through relatively simple examples. This notebook focuses on the concepts and the practical implementation of Bayesian causal reasoning using PyMC.
We do this using the sobering but important example of calculating excess deaths due to COVID-19. As such, the ideas in this notebook strongly overlap with Google's CausalImpact (see {cite:t}google_causal_impact2015). Practically, we will try to estimate the number of 'excess deaths' since the onset of COVID-19, using data from England and Wales. Excess deaths are defined as:
Making a claim about excess deaths requires causal/counterfactual reasoning. While the reported number of deaths is nothing but a (maybe noisy and/or lagged) measure of a real observable fact in the world, expected deaths is unmeasurable because these are never realised in our timeline. That is, the expected deaths is a counterfactual thought experiment where we can ask "What would/will happen if?"
How do we go about this, practically? We will follow this strategy:
pre and post covid datasets. This is an important step. We want to come up with a model based upon what we know before COVID-19 so that we can construct our counterfactual predictions based on data before COVID-19 had any impact.pre dataset.We could take many different approaches to the modelling. Because we are dealing with time series data, then it would be very sensible to use a time series modelling approach. For example, Google's CausalImpact uses a Bayesian structural time-series model, but there are many alternative time series models we could choose.
But because the focus of this case study is on the counterfactual reasoning rather than the specifics of time-series modelling, I chose the simpler approach of linear regression for time-series model (see {cite:t}martin2021bayesian for more on this).
Readers should be aware that there are of course limits to the causal claims we can make here. If we were dealing with a marketing example where we ran a promotion for a period of time and wanted to make inferences about excess sales, then we could only make strong causal claims if we had done our due diligence in accounting for other factors which may have also taken place during our promotion period.
Similarly, there are many other things that changed in the UK since January 2020 (the well documented time of the first COVID-19 cases) in England and Wales. So if we wanted to be rock solid then we should account for other feasibly relevant factors.
Finally, we are not claiming that $x$ people died directly from the COVID-19 virus. The beauty of the concept of excess deaths is that it captures deaths from all causes that are in excess of what we would expect. As such, it covers not only those who died directly from the COVID-19 virus, but also from all downstream effects of the virus and availability of care, for example.
Now let's define some helper functions
For our purposes we will obtain number of deaths (per month) reported in England and Wales. This data is available from the Office of National Statistics dataset Deaths registered monthly in England and Wales. I manually downloaded this data for the years 2006-2022 and aggregated it into a single .csv file. I also added the average UK monthly temperature data as a predictor, obtained from the average UK temperature from the Met Office dataset.
Plotting the time series shows that there is clear seasonality in the number of deaths, and we can also take a guess that there may be an increase in the average number of deaths per year.
Let's take a closer look at the seasonal pattern (just of the pre-covid data) by plotting deaths as a function of month, and we will color code the year. This confirms our suspicion of a seasonal trend in numbers of deaths with there being more deaths in the winter season than the summer. We can also see a large number of deaths in January, followed by a slight dip in February which bounces back in March. This could be due to a combination of:
push-back of deaths that actually occurred in December being registered in Januarypull-forward where many of the vulnerable people who would have died in February ended up dying in January, potentially due to the cold conditions.The colour coding supports our suspicion that there is a positive main effect of year - that the baseline number of deaths per year is increasing.
Let's look at that more closely by plotting the total deaths over time, pre COVID-19. While there is some variability here, it seems like adding a linear trend as a predictor will capture some of the variance in reported deaths, and therefore make for a better model of reported deaths.
Looking at the pre data alone, there is a clear negative relationship between monthly average temperature and the number of deaths. Over a wider range of temperatures it is clear that this deaths will have a U-shaped relationship with temperature. But the climate in England and Wales, we only see the lower side of this curve. Despite that, the relationship could plausibly be approximately quadratic, but for our purposes a linear relationship seems like a reasonable place to start.
Let's examine the slope of this relationship, which will be useful in defining a prior for a temperature coefficient in our model.
Based on this, if we focus only on the relationship between temperature and deaths, we expect there to be 764 fewer deaths for every $1^\circ C$ increase in average monthly temperature. So we can use this figure when it comes to defining a prior over the coefficient for the temperature effect.
We are going to estimate reported deaths over time with an intercept, a linear trend, seasonal deflections (for each month), and average monthly temperature. So this is a pretty straightforward linear model. The only thing of note is that we transform the normally distributed monthly deflections to have a mean of zero in order to reduce the degrees of freedom of the model by one, which should help with parameter identifiability.
As part of the Bayesian workflow, we will plot our prior predictions to see what outcomes the model finds before having observed any data.
This seems reasonable:
We can look at this in more detail with the Arviz prior predictive check (ppc) plot. Again we see that the distribution of the observations is centered on the actual observations but has more spread. This is useful as we know the priors are not too restrictive and are unlikely to systematically influence our posterior predictions upwards or downwards.
Draw samples for the posterior distribution, and remember we are doing this for the pre COVID-19 data only.
Let's also look at the posterior estimates of the monthly deflections, in a different way to focus on the seasonal effect.
Another important aspect of the Bayesian workflow is to plot the model's posterior predictions, allowing us to see how well the model can retrodict the already observed data. It is at this point that we can decide whether the model is too simple (then we'd build more complexity into the model) or if it's fine.
Let's do another check now, but focussing on the seasonal effect. We will replicate the plot that we had above of deaths as a function of month of the year. And in order to keep the plot from being a complete mess, we will just plot the posterior mean. As such this is not a posterior predictive check, but a check of the posterior.
The model is doing a pretty good job of capturing the properties of the data. On the right, we can clearly see the main effects of month and year. However, we can see that there is something interesting happening in the data (left) in January which the model is not capturing. This might be able to be captured in the model by adding an interaction between month and year, but this is left as an exercise for the reader.
This step is not strictly necessary, but we can apply the excess deaths formula to the models' retrodictions for the pre period. This is useful because we can examine how good the model is.
We can see that we have a few spikes here where the number of excess deaths is plausibly greater than zero. Such occasions are above and beyond what we could expect from: a) seasonal effects, b) the linearly increasing trend, c) the effect of cold winters.
If we were interested, then we could start to generate hypotheses about what additional predictors may account for this. Some ideas could include the prevalence of the common cold, or minimum monthly temperatures which could add extra predictive information not captured by the mean.
We can also see that there is some additional temporal trend that the model is not quite capturing. There is some systematic low-frequency drift from the posterior mean from zero. That is, there is additional variance in the data that our predictors are not quite capturing which could potentially be caused by changes in the size of vulnerable cohorts over time.
But we are close to our objective of calculating excess deaths during the COVID-19 period, so we will move on as the primary purpose here is on counterfactual thinking, not on building the most comprehensive model of reported deaths ever.
Now we will use our model to predict the reported deaths in the 'what if?' scenario of business as usual.
So we update the model with the month and time (t) and temp data from the post dataframe and run posterior predictive sampling to predict the number of reported deaths we would observe in this counterfactual scenario. We could also call this 'forecasting'.
We now have the ingredients needed to calculate excess deaths. Namely, the reported number of deaths, and the Bayesian counterfactual prediction of how many would have died if nothing had changed from the pre to post COVID-19 era.
Now we'll use the predicted number of deaths under the counterfactual scenario and compare that to the reported number of deaths to come up with our counterfactual estimate of excess deaths.
And we can easily compute the cumulative excess deaths
And there we have it - we've done some Bayesian counterfactual inference in PyMC! In just a few steps we've:
The bad news of course, is that as of the last data point (May 2022) the number of excess deaths in England and Wales has started to rise again.
:::{bibliography} :filter: docname in docnames :::
:::{include} ../page_footer.md :::