(lecture_12)=

Multilevel Models

:::{post} Jan 7, 2024 :tags: statistical rethinking, bayesian inference, multilevel models :category: intermediate :author: Dustin Stansbury :::

This notebook is part of the PyMC port of the Statistical Rethinking 2023 lecture series by Richard McElreath.

Video - Lecture 12 - Multilevel Models# Lecture 12 - Multilevel Models

Repeat observations

At the end of Lecture 11 on Ordered Categories [blocked], McElreath alluded to leveraging repeat observations of stories and participants in the Trolley dataset to improve estimation

Using reapeat observations can lead to better estimators

Variability in Story Responses

Story repeats

12 stories

Variatbility Participant Responses

331 individuals

Ways of modeling variability across observations

Complete Pooling

\begin{align} R_i &\sim \text{OrderedLogit}(\phi_i, \alpha) \\ \phi_i &= \beta \end{align}

global $\beta$ parameter
treat all unique categories (e.g. story and/or participant) as the same
underfits data because model isn't flexible enough

No Pooling

\begin{align} R_i &\sim \text{OrderedLogit}(\phi_i, \alpha) \\ \phi_i &= \beta_{S[i]} \end{align}

treat all unique categories as independent; each category gets its own $\beta_{S}$
model has "anderograde amnesia"
- doesn't share information across consecutive observations
- inefficient
overfits data because model too flexible, and fits to individual noise

Partial Pooling (Multi-level Models)

\begin{align} R_i &\sim \text{OrderedLogit}(\phi_i, \alpha) \\ \phi_i &= \beta_{S[i]} \\ \beta_i &\sim \text{Prior}_\beta(\eta) \\ \eta &\sim \text{PopulationPrior}(\theta) \end{align}

parameters are drawn from a global distribution that is shared across the population
allows flexibility without overfitting
shares information across observations
- has "memory"
- more efficient
  - learns faster
- get adaptive regularization for free
  - resists overfitting
- improves estimates for imbalanced sampling

Case Study: Reed Frog Survival

48 group; "tanks"
Treatment: density, size, and predation level
Outcome: survival rate

$T$: tank ID
$D$: tank density - the number of tadpoles in each tank - counts
$G$: tank size - categorical (large/small)
$P$: presence/absence of predators - categorical
$S$: survival, the number of tadpoles that survived - counts
propsurv: survival rate $\frac{S}{D}$

Plot average survival rates for all tanks

Let's build a (multi-level) model

\begin{align*} S_i &\sim \text{Binomial}(D_i, p_i) \\ \text{logit}(p_i) &= \alpha_{T[i]} \\ \alpha_j &\sim \text{Normal}(\bar \alpha, \sigma_?) \\ \bar \alpha &\sim \text{Normal}(0, 1.5) \\ \end{align*}

What about the prior variance $\sigma$?

For now, let's try setting $\sigma$ manually via using cross-validation to see the effect on the multi-level model (we'll estiamate it later)

Identifying the optimal $\sigma$ via Cross-validation

$\sigma=0.1$ underfits the data, the model isn't flexible enough to capture the variability of the different tanks
$\sigma=5.0$ overfits the data, the model is too flexible and the posterior centers around each datapoint
$\sigma_{optimal}=?$ We can compare the LOOCV scores for models fit with each $\sigma$ value in the parameter grid, and identify the model with the lowest score.

Model with the optimal $\sigma$ identified through cross-validation

The optimal model

is regularized, trading off bias and variance
demonstrates shrinkage:
- posterior means do not hover over datapoints, but are "pulled" toward the global mean.
- the amount of "pull" is stronger for datapoints further away from the mean

Building a Multilievel (Hierarchical) Model

Automatic regularization

In principle, we can estimate optimal hyperparameter values using cross-validation like we did above, but we don't need to. We can simply learn the optimal $\sigma$ by using a hierarhcical model structure, putting a prior distribution on that variance parameter.

Baby's first the multi-level model

\begin{align*} S_i &\sim \text{Binomial}(D_i, p_i)\\ \text{logit}(p_i) &= \alpha_{T[i]} \\ \alpha_j &\sim \text{Normal}(\bar \alpha, \sigma_?) \\ \bar \alpha &\sim \text{Normal}(0, 1.5) \\ \sigma &\sim \text{Exponential}(1) \end{align*}

Note that this prior is shared amongst all groups, allowing it to perform partial pooling.

Plot the priors for baby's first multi-level model

Note the prior distribution for $a_j$ (far right plot above) is a mixture of Normal distributions with different means (sampled from the $\bar a$ prior distribution) and variances (sampled from the $\sigma$ prior distribution). Therefore it isn't a Normal, but a thicker-tailed distribution, more akin to a student-t

Fit the multi-level model

Summarize the multi-level model posterior

The multi-level Mode's Posterior HDI overlaps the optimal value found via cross-validation

multi-level models learn population variation automatically, efficiently
get regularization for free

Comparing multi-level and fixed-sigma model

Fixed sigma model

\begin{align*} S_i &\sim \text{Binomial}(D_i, p_i)\\ \text{logit}(p_i) &= \alpha_{T[i]} \\ \alpha_j &\sim \text{Normal}(\bar \alpha, 1) \\ \bar \alpha &\sim \text{Normal}(0, 1.5) \\ \end{align*}

multi-level model performs better in terms of LOO score (no surprise)
multi-level model has fewer effective parameters despite having more programmed paramters
- this is because the model is more efficient, and shares information across parameters

Demonstrating Posterior Uncertainty due to sample size

Small tanks (on the far left of the above figure) have fewer observations, and thus
- wider posterior variances
- larger amount of shrinkage toward the global mean
Large tanks (on the far right of the above figure) have more observations, and thus
- tighter posterior variances
- less shrinkage toward the global mean

Including the presence of predators

Highlighting the tanks with predators shows that the presence of predators (red) decreases survival rates

Multilevel model with predator effects

\begin{align*} S_i &\sim \text{Binomial}(D_i, p_i)\\ \text{logit}(p_i) &= \alpha_{T[i]} + \beta_P P_i\\ \beta_P &\sim \text{Normal}(0, .5) \\ \alpha_j &\sim \text{Normal}(\bar \alpha, \sigma) \\ \bar \alpha &\sim \text{Normal}(0, 1.5) \\ \sigma &\sim \text{Exponential}(1) \end{align*}

Predictions vs Inference

Model without predators can predict survival just as well as the model with predators
This is because the multi-level model can still capture the tank-level variablity through the individual $\alpha_T$
However, adding predators "explains away" a lot more of the tank-level variance. This is demonstrated by the $\sigma$ values for tank-level variability being smaller in the predator model. Thus the predator model has to consider less tank-based variablity in order to capture the variability in survival.

Varying Effect Superstitions

~~Partial pooling requires random sampling from the population ❌~~
~~Number of categories / units must be large ❌~~
~~Variation must be Gaussian ❌~~
- Gaussian priors can learn non-Gaussian distributions

Practical Difficulties

Using multiple clusters simultaneously (e.g. participants AND stories)
Sampling efficiency -- recoding (e.g. centered/non-centered priors)
Partial pooling on other parameters (e.g. slopes) or unobserved confounds?

These difficulties will be addressed in upcoming lectures on Multi-level models. 😅

BONUS: Fixed Effects, Multilevel Models, & Mundlak Machines

Random Confounds

Estimand: Influence of $X$ on $Y$

Outcome $Y$ (e.g. tadpole survival)
Individual-level traits $X$
Group-level traits $Z$
Unobserved tank effects $G$
- effects both $X$ and $Y$
- e.g. tank temperature
- creates a backdoor path from $X$ to $Y$ (via fork)
- We can't directly measure $G$.
- However, if we have repeat observations, there's some tricks up our sleeve that we can use.

Estimator?

Fixed Effects model
Multi-level model
Muldlak Machine

Simulate some data

Show below that groups aren't randomly sampled

Naive Model

\begin{align*} Y &\sim \text{Bernoulli}(p) \\ \text{logit}(p_i) &= \alpha + \beta_{X}X + \beta_{Z} Z_{G[i]} \\ \alpha &\sim \text{Normal}(0, 10) \\ \beta_* &\sim \text{Normal}(0, 1) \end{align*} $$*

Fixed effect Model

\begin{align*} Y &\sim \text{Bernoulli}(p) \\ \text{logit}(p_i) &= \alpha_{G[i]} + \beta_{X}X_i + \beta_{Z} Z_{G[i]} \\ \alpha_j &\sim \text{Normal}(0, 10) \\ \beta_* &\sim \text{Normal}(0, 1) \end{align*} $$*

Estimate a different average rate for each group without pooling
$\beta_Z$ and $\beta_X$ are global parameters
Accounts for group-level confounding effect of $G$ (via offsets for each group)
Problems:
- Cannot identify any group-level effects $Z$
  - $Z$ is unidentifiable because it's added globally
    - there are an infinite number of equivalent combinations of $\alpha_{G[i]} + \beta_Z Z_{G[i]}$
    - can't separate contribution of $\beta_Z$ from $\alpha$
  - Can't include group-level predictors with Mixed-effect model
- Inefficient