Repeat observations
At the end of Lecture 11 on Ordered Categories [blocked], McElreath alluded to leveraging repeat observations of stories and participants in the Trolley dataset to improve estimation
Using reapeat observations can lead to better estimators
Variability in Story Responses
Story repeats
12 stories
Variatbility Participant Responses
331 individuals
Ways of modeling variability across observations
Complete Pooling
Riϕi∼OrderedLogit(ϕi,α)=β
- global $\beta$ parameter
- treat all unique categories (e.g. story and/or participant) as the same
- underfits data because model isn't flexible enough
No Pooling
Riϕi∼OrderedLogit(ϕi,α)=βS[i]
- treat all unique categories as independent; each category gets its own $\beta_{S}$
- model has "anderograde amnesia"
- doesn't share information across consecutive observations
- inefficient
- overfits data because model too flexible, and fits to individual noise
Partial Pooling (Multi-level Models)
Riϕiβiη∼OrderedLogit(ϕi,α)=βS[i]∼Priorβ(η)∼PopulationPrior(θ)
- parameters are drawn from a global distribution that is shared across the population
- allows flexibility without overfitting
- shares information across observations
- has "memory"
- more efficient
- get adaptive regularization for free
- improves estimates for imbalanced sampling
Case Study: Reed Frog Survival
- 48 group; "tanks"
- Treatment: density, size, and predation level
- Outcome: survival rate
- $T$: tank ID
- $D$: tank density - the number of tadpoles in each tank - counts
- $G$: tank size - categorical (large/small)
- $P$: presence/absence of predators - categorical
- $S$: survival, the number of tadpoles that survived - counts
- propsurv: survival rate $\frac{S}{D}$
Plot average survival rates for all tanks
Let's build a (multi-level) model
Silogit(pi)αjαˉ∼Binomial(Di,pi)=αT[i]∼Normal(αˉ,σ?)∼Normal(0,1.5)What about the prior variance $\sigma$?
For now, let's try setting $\sigma$ manually via using cross-validation to see the effect on the multi-level model (we'll estiamate it later)
Identifying the optimal $\sigma$ via Cross-validation
- $\sigma=0.1$ underfits the data, the model isn't flexible enough to capture the variability of the different tanks
- $\sigma=5.0$ overfits the data, the model is too flexible and the posterior centers around each datapoint
- $\sigma_{optimal}=?$ We can compare the LOOCV scores for models fit with each $\sigma$ value in the parameter grid, and identify the model with the lowest score.
Model with the optimal $\sigma$ identified through cross-validation
The optimal model
- is regularized, trading off bias and variance
- demonstrates shrinkage:
- posterior means do not hover over datapoints, but are "pulled" toward the global mean.
- the amount of "pull" is stronger for datapoints further away from the mean
Building a Multilievel (Hierarchical) Model
Automatic regularization
In principle, we can estimate optimal hyperparameter values using cross-validation like we did above, but we don't need to. We can simply learn the optimal $\sigma$ by using a hierarhcical model structure, putting a prior distribution on that variance parameter.
Baby's first the multi-level model
Silogit(pi)αjαˉσ∼Binomial(Di,pi)=αT[i]∼Normal(αˉ,σ?)∼Normal(0,1.5)∼Exponential(1)Note that this prior is shared amongst all groups, allowing it to perform partial pooling.
Plot the priors for baby's first multi-level model
Note the prior distribution for $a_j$ (far right plot above) is a mixture of Normal distributions with different means (sampled from the $\bar a$ prior distribution) and variances (sampled from the $\sigma$ prior distribution). Therefore it isn't a Normal, but a thicker-tailed distribution, more akin to a student-t
Fit the multi-level model
Summarize the multi-level model posterior
The multi-level Mode's Posterior HDI overlaps the optimal value found via cross-validation
- multi-level models learn population variation automatically, efficiently
- get regularization for free
Comparing multi-level and fixed-sigma model
Fixed sigma model
Silogit(pi)αjαˉ∼Binomial(Di,pi)=αT[i]∼Normal(αˉ,1)∼Normal(0,1.5)
- multi-level model performs better in terms of LOO score (no surprise)
- multi-level model has fewer effective parameters despite having more programmed paramters
- this is because the model is more efficient, and shares information across parameters
Demonstrating Posterior Uncertainty due to sample size
- Small tanks (on the far left of the above figure) have fewer observations, and thus
- wider posterior variances
- larger amount of shrinkage toward the global mean
- Large tanks (on the far right of the above figure) have more observations, and thus
- tighter posterior variances
- less shrinkage toward the global mean
Including the presence of predators
Highlighting the tanks with predators shows that the presence of predators (red) decreases survival rates
Multilevel model with predator effects
Silogit(pi)βPαjαˉσ∼Binomial(Di,pi)=αT[i]+βPPi∼Normal(0,.5)∼Normal(αˉ,σ)∼Normal(0,1.5)∼Exponential(1)
- Model without predators can predict survival just as well as the model with predators
- This is because the multi-level model can still capture the tank-level variablity through the individual $\alpha_T$
- However, adding predators "explains away" a lot more of the tank-level variance. This is demonstrated by the $\sigma$ values for tank-level variability being smaller in the predator model. Thus the predator model has to consider less tank-based variablity in order to capture the variability in survival.
Varying Effect Superstitions
Partial pooling requires random sampling from the population ❌
Number of categories / units must be large ❌
Variation must be Gaussian ❌
- Gaussian priors can learn non-Gaussian distributions
Practical Difficulties
- Using multiple clusters simultaneously (e.g. participants AND stories)
- Sampling efficiency -- recoding (e.g. centered/non-centered priors)
- Partial pooling on other parameters (e.g. slopes) or unobserved confounds?
These difficulties will be addressed in upcoming lectures on Multi-level models. 😅
BONUS: Fixed Effects, Multilevel Models, & Mundlak Machines
Estimand: Influence of $X$ on $Y$
- Outcome $Y$ (e.g. tadpole survival)
- Individual-level traits $X$
- Group-level traits $Z$
- Unobserved tank effects $G$
- effects both $X$ and $Y$
- e.g. tank temperature
- creates a backdoor path from $X$ to $Y$ (via fork)
- We can't directly measure $G$.
- However, if we have repeat observations, there's some tricks up our sleeve that we can use.
Estimator?
- Fixed Effects model
- Multi-level model
- Muldlak Machine
Simulate some data
Show below that groups aren't randomly sampled
Naive Model
\begin{align*}
Y &\sim \text{Bernoulli}(p) \\
\text{logit}(p_i) &= \alpha + \beta_{X}X + \beta_{Z} Z_{G[i]} \\
\alpha &\sim \text{Normal}(0, 10) \\
\beta_* &\sim \text{Normal}(0, 1)
\end{align*}
$$*Fixed effect Model
\begin{align*}
Y &\sim \text{Bernoulli}(p) \\
\text{logit}(p_i) &= \alpha_{G[i]} + \beta_{X}X_i + \beta_{Z} Z_{G[i]} \\
\alpha_j &\sim \text{Normal}(0, 10) \\
\beta_* &\sim \text{Normal}(0, 1)
\end{align*}
$$*
- Estimate a different average rate for each group without pooling
- $\beta_Z$ and $\beta_X$ are global parameters
- Accounts for group-level confounding effect of $G$ (via offsets for each group)
- Problems:
- Cannot identify any group-level effects $Z$
- $Z$ is unidentifiable because it's added globally
- there are an infinite number of equivalent combinations of $\alpha_{G[i]} + \beta_Z Z_{G[i]}$
- can't separate contribution of $\beta_Z$ from $\alpha$
- Can't include group-level predictors with Mixed-effect model
- Inefficient
Multilevel Model
\begin{align*}
Y &\sim \text{Bernoulli}(p) \\
\text{logit}(p_i) &= \alpha_{G[i]} + \beta_{X}X + \beta_{Z} Z_{G[i]} \\
\beta_* &\sim \text{Normal}(0, 1) \\
\alpha_j &\sim \text{Normal}(\bar a, \tau) \\
\bar a &\sim \text{Normal}(0, 1) \\
\tau &\sim \text{Exponential}(10)
\end{align*}
$$*
- Estimate a different average rate for each group with partial pooling
- better estimates for $G$ effects
- at the expense of estimating $X$ effects
- compromises estimating the confound so that it can get better estimates for the group
- CAN identify $Z$ effects
- CAN incorporate group-level predictors
Fixed effect models are better (though not great in this simulation) than multi-level (and naive) model at identifying the individual-level group confound, but can't identify any group effects.
Multi-level models can identify the group effects, though it can't identify the confound. Fixed effect models can't say much about the group level effects.
- If interested completely in prediction, multi-level models are better
- If interested in inference (i.e. via do-calculus or counterfactuals), you might want to use fixed effects (though group-level prediction accuracy may be poor)
Statistical Model
\begin{align*}
Y &\sim \text{Bernoulli}(p) \\
\text{logit}(p_i) &= \alpha_{G[i]} + \beta_{X}X + \beta_{Z} Z_{G[i]} + \beta_{\bar X} \bar X_{G[i]}\\
\beta_* &\sim \text{Normal}(0, 1) \\
\alpha_j &\sim \text{Normal}(\bar a, \tau) \\
\bar a &\sim \text{Normal}(0, 1) \\
\tau &\sim \text{Exponential}(10)
\end{align*}
$$*
- Estimate a different average rate for each group with partial pooling
- Takes advantage of the idea that conditioning on a descendent in a graph can (at least partially) deconfound a the parent/ancestor variable.
- Uses the group-level mean as child of the confounding variables to reduce it's confounding effect.
- Problems:
- somewhat inefficient
- because groups have different size, we also need to consider the uncertainty in the estimation of the group-level means, which the Mundlak machine ignores
Mundlak machines are able to capture both Treatment and Group-level effects
Latent Mundlak Machine (aka "Full Luxury Bayes")
- We model not only the observed outcome $Y$ (as a function of $X, G, Z$), but also the treatment $X$ (as a function of the confound G).
- Unlike the Mundlak machine, which collapses $\bar X$ into a point estimate, thus ignoring the uncertainty in that mean estimate, the Latent Mundlak estimates this uncertainty by including the submodel on $X$.
- akin to a measurment error model
So the original graph:
XiμXiUGαXσX∼Normal(μX,σX)=αX+βGXUG[i]∼Normal(0,1)∼Normal(0,1)∼Exponential(1) \begin{align*}
Y_i &\sim \text{Bernoulli}(p_i) \\
\text{logit}(p_i) &= \alpha_{Y,G[i]} + \beta_{XY}X_i + \beta_{GY} U_{G[i]}\\
U_G &\sim \text{Normal}(0, 1) \\
\beta_* &\sim \text{Normal}(0, 1) \\
\alpha_{GY} &\sim \text{Normal}(\bar a, \tau) \\
\bar a &\sim \text{Normal}(0, 1) \\
\tau &\sim \text{Exponential}(1)
\end{align*}
$$*
Random Confounds: Summary
- Should you use fixed effects? sometimes, but generally Full-luxury is the way to go
- Should you use Mundlak Machine / Average X?
- Sometimes: it does simplify numerical compuation, but at the expense of losing uncertainty estimation.
- Usually have the compute power anyways, so why not just use FLB?
- Use full generative model
- no single solution, so just be explicit about model
Authors
- Ported to PyMC by Dustin Stansbury (2024)
- Based on Statistical Rethinking (2023) lectures by Richard McElreath
:::{include} ../page_footer.md
:::