multilevel_modeling.ipynb

https://github.com/pymc-devs/pymc-examples/blob/main/examples/generalized_linear_models/multilevel_modeling.ipynb

(GLM-hierarchical)= (multilevel_modeling)=

A Primer on Bayesian Methods for Multilevel Modeling

:::{post} 24 October, 2022 :tags: hierarchical model, case study, generalized linear model :category: intermediate :author: Chris Fonnesbeck, Colin Carroll, Alex Andorra, Oriol Abril, Farhan Reynaldo :::

Hierarchical or multilevel modeling is a generalization of regression modeling.

Multilevel models are regression models in which the constituent model parameters are given probability models. This implies that model parameters are allowed to vary by group.

Observational units are often naturally clustered. Clustering induces dependence between observations, despite random sampling of clusters and random sampling within clusters.

A hierarchical model is a particular multilevel model where parameters are nested within one another.

Some multilevel structures are not hierarchical.

e.g. "country" and "year" are not nested, but may represent separate, but overlapping, clusters of parameters

We will motivate this topic using an environmental epidemiology example.

Example: Radon contamination {cite:t}`gelman2006data`

Radon is a radioactive gas that enters homes through contact points with the ground. It is a carcinogen that is the primary cause of lung cancer in non-smokers. Radon levels vary greatly from household to household.

The EPA did a study of radon levels in 80,000 houses. There are two important predictors:

measurement in basement or first floor (radon higher in basements)
county uranium level (positive correlation with radon levels)

We will focus on modeling radon levels in Minnesota.

The hierarchy in this example is households within county.

Data organization

First, we import the data from a local file, and extract Minnesota's data.

The original data exists as several independent datasets, which we will import, merge, and process here. First is the data on measurements from individual homes from across the United States. We will extract just the subset from Minnesota.

Next, obtain the county-level predictor, uranium, by combining two variables.

Use the merge method to combine home- and county-level information in a single DataFrame.

Let's encode the county names and make local copies of the variables we will use. We also need a lookup table (dict) for each unique county, for indexing.

Distribution of radon levels in MN (log scale):

Conventional approaches

The two conventional alternatives to modeling radon exposure represent the two extremes of the bias-variance tradeoff:

Complete pooling:

Treat all counties the same, and estimate a single radon level.

$y_i = \alpha + \beta x_i + \epsilon_i$

No pooling:

Model radon in each county independently.

$y_i = \alpha_{j[i]} + \beta x_i + \epsilon_i$

where $j = 1,\ldots,85$

The errors $\epsilon_i$ may represent measurement error, temporal within-house variation, or variation among houses.

Here are the point estimates of the slope and intercept for the complete pooling model:

You may be wondering why we are using the pm.Data container above even though the variable floor_ind is not an observed variable nor a parameter of the model. As you'll see, this will make our lives much easier when we'll plot and diagnose our model.ArviZ will thus include floor_ind as a variable in the constant_data group of the resulting {ref}InferenceData <xarray_for_arviz> object. Moreover, including floor_ind in the InferenceData object makes sharing and reproducing analysis much easier, all the data needed to analyze or rerun the model is stored there.

Before running the model let's do some prior predictive checks.

Indeed, having sensible priors is not only a way to incorporate scientific knowledge into the model, it can also help and make the MCMC machinery faster -- here we are dealing with a simple linear regression, so no link function comes and distorts the outcome space; but one day this will happen to you and you'll need to think hard about your priors to help your MCMC sampler. So, better to train ourselves when it's quite easy than having to learn when it's very hard.

There is a convenient function for prior predictive sampling in PyMC:

ArviZ InferenceData uses xarray.Datasets under the hood, which give access to several common plotting functions with .plot. In this case, we want scatter plot of the mean log radon level (which is stored in variable a) for each of the two levels we are considering. If our desired plot is supported by xarray plotting capabilities, we can take advantage of xarray to automatically generate both plot and labels for us. Notice how everything is directly plotted and annotated, the only change we need to do is renaming the y axis label from a to Mean log radon level.

I'm no radon expert, but before seeing the data, these priors seem to allow for quite a wide range of the mean log radon level, both as measured either in a basement or on a floor. But don't worry, we can always change these priors if sampling gives us hints that they might not be appropriate -- after all, priors are assumptions, not oaths; and as with most assumptions, they can be tested.

However, we can already think of an improvement: Remember that we stated radon levels tend to be higher in basements, so we could incorporate this prior scientific knowledge into our model by forcing the floor effect (beta) to be negative. For now, we will leave the model as is, and trust that the information in the data will be sufficient.

Speaking of sampling, let's fire up the Bayesian machinery!

No divergences and a sampling that only took seconds! Here the chains look very good (good R hat, good effective sample size, small sd). The model also estimated a negative floor effect, as we expected.

Let's plot the expected radon levels in basements (alpha) and on floors (alpha + beta) in relation to the data used to fit the model:

This looks reasonable, though notice that there is a great deal of residual variability in the data.

Let's now turn our attention to the unpooled model, and see how it fares in comparison.

The sampling was clean here too; Let's look at the expected values for both basement (dimension 0) and floor (dimension 1) in each county:

To identify counties with high radon levels, we can plot the ordered mean estimates, as well as their 94% HPD:

Now that we have fit both conventional (i.e. non-hierarchcial) models, let's see how their inferences differ. Here are visual comparisons between the pooled and unpooled estimates for a subset of counties representing a range of sample sizes.

Neither of these models are satisfactory:

If we are trying to identify high-radon counties, pooling is useless -- because, by definition, the pooled model estimates radon at the state-level. In other words, pooling leads to maximal underfitting: the variation across counties is not taken into account and only the overall population is estimated.
We do not trust extreme unpooled estimates produced by models using few observations. This leads to maximal overfitting: only the within-county variations are taken into account and the overall population (i.e the state-level, which tells us about similarities across counties) is not estimated.

This issue is acute for small sample sizes, as seen above: in counties where we have few floor measurements, if radon levels are higher for those data points than for basement ones (Aitkin, Koochiching, Ramsey), the model will estimate that radon levels are higher in floors than basements for these counties. But we shouldn't trust this conclusion, because both scientific knowledge and the situation in other counties tell us that it is usually the reverse (basement radon > floor radon). So unless we have a lot of observations telling us otherwise for a given county, we should be skeptical and shrink our county-estimates to the state-estimates -- in other words, we should balance between cluster-level and population-level information, and the amount of shrinkage will depend on how extreme and how numerous the data in each cluster are.

Here is where hierarchical models come into play.

Multilevel and hierarchical models

When we pool our data, we imply that they are sampled from the same model. This ignores any variation among sampling units (other than sampling variance) -- we assume that counties are all the same:

[Image blocked: pooled]

When we analyze data unpooled, we imply that they are sampled independently from separate models. At the opposite extreme from the pooled case, this approach claims that differences between sampling units are too large to combine them -- we assume that counties have no similarity whatsoever:

[Image blocked: unpooled]

In a hierarchical model, parameters are viewed as a sample from a population distribution of parameters. Thus, we view them as being neither entirely different or exactly the same. This is partial pooling:

[Image blocked: hierarchical]

We can use PyMC to easily specify multilevel models, and fit them using Markov chain Monte Carlo.

Partial pooling model

The simplest partial pooling model for the household radon dataset is one which simply estimates radon levels, without any predictors at any level. A partial pooling model represents a compromise between the pooled and unpooled extremes, approximately a weighted average (based on sample size) of the unpooled county estimates and the pooled estimates.

$\hat{\alpha} \approx \frac{(n_j/\sigma_y^2)\bar{y}_j + (1/\sigma_{\alpha}^2)\bar{y}}{(n_j/\sigma_y^2) + (1/\sigma_{\alpha}^2)}$

Estimates for counties with smaller sample sizes will shrink towards the state-wide average, while those for counties with larger sample sizes will be closer to the unpooled county estimates.

Let's start with the simplest model, which ignores the effect of floor vs. basement measurement.

Notice the difference between the unpooled and partially-pooled estimates, particularly at smaller sample sizes: As expected, the former are both more extreme and more imprecise. Indeed, in the partially-pooled model, estimates in small-sample-size counties are informed by the population parameters -- hence more precise estimates. Moreover, the smaller the sample size, the more regression towards the overall mean (the dashed gray line) -- hence less extreme estimates. In other words, the model is skeptical of extreme deviations from the population mean in counties where data is sparse. This is known as shrinkage.

Now let's go back and integrate the floor predictor, but allowing the intercept to vary by county.

Varying intercept model

This model allows intercepts to vary across county, according to a random effect.

$y_i = \alpha_{j[i]} + \beta x_{i} + \epsilon_i$

where

$\epsilon_i \sim N(0, \sigma_y^2)$

and the intercept random effect:

$\alpha_{j[i]} \sim N(\mu_{\alpha}, \sigma_{\alpha}^2)$

As with the the “no-pooling” model, we set a separate intercept for each county, but rather than fitting separate least squares regression models for each county, multilevel modeling shares strength among counties, allowing for more reasonable inference in counties with little data.

The estimate for the floor coefficient is approximately -0.66, which can be interpreted as houses without basements having about half ($\exp(-0.66) = 0.52$) the radon levels of those with basements, after accounting for county.

It is easy to show that the partial pooling model provides more objectively reasonable estimates than either the pooled or unpooled models, at least for counties with small sample sizes.

Varying intercept and slope model

The most general model allows both the intercept and slope to vary by county:

$y_i = \alpha_{j[i]} + \beta_{j[i]} x_{i} + \epsilon_i$

Notice that the trace of this model includes divergences, which can be problematic depending on where and how frequently they occur. These can occur is some hierararchical models, and they can be avoided by using the non-centered parametrization.

Non-centered Parameterization

The partial pooling models specified above uses a centered parameterization of the slope random effect. That is, the individual county effects are distributed around a county mean, with a spread controlled by the hierarchical standard deviation parameter. As the preceding plot reveals, this constraint serves to shrink county estimates toward the overall mean, to a degree proportional to the county sample size. This is exactly what we want, and the model appears to fit well--the Gelman-Rubin statistics are exactly 1.

But, on closer inspection, there are signs of trouble. Specifically, let's look at the trace of the random effects, and their corresponding standard deviation:

Notice that when the chain reaches the lower end of the parameter space for $\sigma_b$, it appears to get "stuck" and the entire sampler, including the random slopes b, mixes poorly.

Jointly plotting the random effect variance and one of the individual random slopes demonstrates what is going on.

When the group variance is small, this implies that the individual random slopes are themselves close to the group mean. This results in a funnel-shaped relationship between the samples of group variance and any of the slopes (particularly those with a smaller sample size).

In itself, this is not a problem, since this is the behavior we expect. However, if the sampler is tuned for the wider (unconstrained) part of the parameter space, it has trouble in the areas of higher curvature. The consequence of this is that the neighborhood close to the lower bound of $\sigma_b$ is sampled poorly; indeed, in our chain it is not sampled at all below 0.1. The result of this will be biased inference.

Now that we've spotted the problem, what can we do about it? The best way to deal with this issue is to reparameterize our model. Notice the random slopes in this version:

This is a non-centered parameterization. By this, we mean that the random deviates are no longer explicitly modeled as being centered on $\mu_b$. Instead, they are independent standard normals $\upsilon$, which are then scaled by the appropriate value of $\sigma_b$, before being location-transformed by the mean.

This model samples much better.

Notice that the bottlenecks in the traces are gone.|

And correspondingly, the low end of the posterior distribution of the slope random effect variance can now be sampled efficiently.

As a result, we are now fully exploring the support of the posterior. This results in less bias in these parameters.

Notice that sigma_b now has a lot of density near zero, which would indicate that counties don't vary that much in their answer to the floor "treatment".

This was the problem with the original parameterization: the sampler has difficulty with the geometry of the posterior distribution when the values of the slope random effects are so different for standard deviations very close to zero compared to when they are positive. However, even with the non-centered model the sampler is not that comfortable with sigma_b: in fact if you look at the estimates with az.summary you'll see that the number of effective samples is quite low for sigma_b.

Also note that sigma_a is not that big either -- i.e counties do differ in their baseline radon levels, but not by a lot. However we don't have that much of a problem to sample from this distribution because it's much narrower than sigma_b and doesn't get dangerously close to 0.

To wrap up this model, let's plot the relationship between radon and floor for each county:

This, while both the intercept and the slope vary by county, there is far less variation in the slope.

But wait, there is more! We can (and maybe should) take into account the covariation between intercepts and slopes: when baseline radon is low in a given county, maybe that means the difference between floor and basement measurements will decrease -- because there isn't that much radon anyway. That would translate into a positive correlation between alpha and beta, and adding that into our model would make even more efficient use the available data.

To model this correlation, we'll use a multivariate Normal distribution instead of two different Normals for alpha and beta. This simply means that each county's parameters come from a common distribution with mean mu_alpha for intercepts and mu_beta for slopes, and slopes and intercepts co-vary according to the covariance matrix S. In mathematical form:

$y \sim Normal(\theta, \sigma)$

$\theta = \alpha + \beta \times floor$

$\begin{bmatrix} \alpha \\ \beta \end{bmatrix} \sim MvNormal(\begin{bmatrix} \mu_{\alpha} \\ \mu_{\beta} \end{bmatrix}, \Sigma)$

P \begin{pmatrix} \sigma_{\alpha} & 0 \\ 0 & \sigma_{\beta} \end{pmatrix}$$

where $\alpha$ and $\beta$ are the mean intercept and slope respectively, $\sigma_{\alpha}$ and $\sigma_{\beta}$ represent the variation in intercepts and slopes respectively, and $P$ is the correlation matrix of intercepts and slopes. In this case, as their is only one slope, $P$ contains only one relevant figure: the correlation between $\alpha$ and $\beta$.

This translates quite easily in PyMC:

This is by far the most complex model we've done so far, so the model code is correspondingly complex. The main complication is the use of a LKJCholeskyCov distribution for the covariance matrix. This is a Cholesky decomposition of the covariance matrix that allows it to sample more easily.

As you may expect, we also want to non-center the random effects here. This again results in a Deterministic operation that here multiplies the covariance with independent standard normals.

So the correlation between slopes and intercepts seems to be negative: when the county intercept increases, the county slope tends to decrease. In other words, when basement radon in a county gets bigger, the difference with floor radon tends to get bigger too (because floor readings get smaller while basement readings get bigger). But again, the uncertainty is wide that it's possible the correlation goes the other way around or is simply close to zero.

And how much variation is there across counties? It's not easy to read sigma_ab above, so let's do a forest plot and compare the estimates with the model that doesn't include the covariation between slopes and intercepts:

The estimates are very close to each other, both for the means and the standard deviations. But remember, the information given by the correlation is only seen at the county level: in theory it uses even more information from the data to get an even more informed pooling of information for all county parameters. So let's visually compare estimates of both models at the county level:

The negative correlation is somewhat clear here: when the intercept increases, the slope decreases. So we understand why the model put most of the posterior weight into negative territory for the correlation term. Nevertheless, the model gives a non-trivial posterior probability to the possibility that the correlation could in fact be zero or positive.

Interestingly, the differences between both models occur at extreme slope and intercept values. This is because the second model used the slightly negative correlation between intercepts and slopes to adjust their estimates: when intercepts are larger (smaller) than average, the model pushes down (up) the associated slopes.

Globally, there is a lot of agreement here: modeling the correlation didn’t change inference that much. We already saw that radon levels tended to be lower in floors than basements, and when we checked the posterior distributions of the average effects (alpha and beta) and standard deviations, we noticed that they were almost identical. But on average the model with covariation will be more accurate -- because it squeezes additional information from the data, to shrink estimates in both dimensions.

Adding group-level predictors

A primary strength of multilevel models is the ability to handle predictors on multiple levels simultaneously. If we consider the varying-intercepts model above:

$y_i = \alpha_{j[i]} + \beta x_{i} + \epsilon_i$

we may, instead of a simple random effect to describe variation in the expected radon value, specify another regression model with a county-level covariate. Here, we use the county uranium reading $u_j$, which is thought to be related to radon levels:

$\alpha_j = \gamma_0 + \gamma_1 u_j + \zeta_j$

$\zeta_j \sim N(0, \sigma_{\alpha}^2)$

Thus, we are now incorporating a house-level predictor (floor or basement) as well as a county-level predictor (uranium).

Note that the model has both indicator variables for each county, plus a county-level covariate. In classical regression, this would result in collinearity. In a multilevel model, the partial pooling of the intercepts towards the expected value of the group-level linear model avoids this.

Group-level predictors also serve to reduce group-level variation, $\sigma_{\alpha}$ (here it would be the variation across counties, sigma_a). An important implication of this is that the group-level estimate induces stronger pooling -- by definition, a smaller $\sigma_{\alpha}$ means a stronger shrinkage of counties parameters towards the overall state mean.

This is fairly straightforward to implement in PyMC -- we just add another level:

Do you see the new level, with sigma_a and gamma, which is two-dimensional because it contains the linear model for a_county?

Uranium is indeed strongly associated with baseline radon levels in each county. The graph above shows the average relationship and its uncertainty: the baseline radon level in an average county as a function of uranium, as well as the 94% HPD of this radon level (dashed line and envelope). The blue points and orange bars represent the relationship between baseline radon and uranium, but now for each county. As you see, the uncertainty is bigger now, because it adds on top of the average uncertainty -- each county has its idyosyncracies after all.

If we compare the county-intercepts for this model with those of the partial-pooling model without a county-level covariate:The standard errors on the intercepts are narrower than for the partial-pooling model without a county-level covariate.

We see that the compatibility intervals are narrower for the model including the county-level covariate. This is expected, as the effect of a covariate is to reduce the variation in the outcome variable -- provided the covariate is of predictive value. More importantly, with this model we were able to squeeze even more information out of the data.

Correlations among levels

In some instances, having predictors at multiple levels can reveal correlation between individual-level variables and group residuals. We can account for this by including the average of the individual predictors as a covariate in the model for the group intercept.

$\alpha_j = \gamma_0 + \gamma_1 u_j + \gamma_2 \bar{x} + \zeta_j$

These are broadly referred to as contextual effects.

To add these effects to our model, let's create a new variable containing the mean of floor in each county and add that to our previous model:

So we might infer from this that counties with higher proportions of houses without basements tend to have higher baseline levels of radon. This seems to be new, as up to this point we saw that floor was negatively associated with radon levels. But remember this was at the household-level: radon tends to be higher in houses with basements. But at the county-level it seems that the less basements on average in the county, the more radon. So it's not that contradictory. What's more, the estimate for $\gamma_2$ is quite uncertain and overlaps with zero, so it's possible that the relationship is not that strong. And finally, let's note that $\gamma_2$ estimates something else than uranium's effect, as this is already taken into account by $\gamma_1$ -- it answers the question "once we know uranium level in the county, is there any value in learning about the proportion of houses without basements?".

All of this is to say that we shouldn't interpret this causally: there is no credible mechanism by which a basement (or absence thereof) causes radon emissions. More probably, our causal graph is missing something: a confounding variable, one that influences both basement construction and radon levels, is lurking somewhere in the dark... Perhaps is it the type of soil, which might influence what type of structures are built and the level of radon? Maybe adding this to our model would help with causal inference.

Prediction

{cite:t}gelman2006multilevel used cross-validation tests to check the prediction error of the unpooled, pooled, and partially-pooled models

root mean squared cross-validation prediction errors:

unpooled = 0.86
pooled = 0.84
multilevel = 0.79

There are two types of prediction that can be made in a multilevel model:

a new individual within an existing group
a new individual within a new group

For example, if we wanted to make a prediction for a new house with no basement in St. Louis and Kanabec counties, we just need to sample from the radon model with the appropriate intercept.

That is,

$\tilde{y}_i \sim N(\alpha_{69} + \beta (x_i=1), \sigma_y^2)$

Because we judiciously set the county index and floor values as shared variables earlier, we can modify them directly to the desired values (69 and 1 respectively) and sample corresponding posterior predictions, without having to redefine and recompile our model. Using the model just above:

Benefits of Multilevel Models

Accounting for natural hierarchical structure of observational data.
Estimation of coefficients for (under-represented) groups.
Incorporating individual- and group-level information when estimating group-level coefficients.
Allowing for variation among individual-level coefficients across groups.

As an alternative approach to hierarchical modeling for this problem, check out a geospatial approach to modeling radon levels.

References

:::{bibliography} :filter: docname in docnames

mcelreath2018statistical :::

Authors

Authored by Chris Fonnesbeck in May, 2017 (pymc#2124)
Updated by Colin Carroll in June, 2018 (pymc#3049)
Updated by Alex Andorra in January, 2020 (pymc#3765)
Updated by Oriol Abril in June, 2020 (pymc#3963)
Updated by Farhan Reynaldo in November 2021 (pymc-examples#246)
Updated by Chris Fonnesbeck in Februry 2022 (pymc-examples#285)
Updated by Chris Fonnesbeck in November 2022 (pymc-examples#468)
Updated by Oriol Abril in November 2022 (pymc-examples#473)

Watermark

:::{include} ../page_footer.md :::

Empty cell