(lecture_14)=

Correlated Features

:::{post} Jan 7, 2024 :tags: statistical rethinking, bayesian inference, correlated features :category: intermediate :author: Dustin Stansbury :::

This notebook is part of the PyMC port of the Statistical Rethinking 2023 lecture series by Richard McElreath.

Video - Lecture 14 - Correlated Features# Lecture 14 Correlated Features

Revisiting Bangladesh Fertility

  • Estimand 1: Contraceptive use $C$ in each District $D$
  • Estimand 2: Effect of "urbanity" $U$
  • Estimand 3: Effect of # of kids $K$ and age $A$

Districts are CLUSTERS

Rural / Urban breakdown

Below we show the contraceptive use for rural (only including a district-level intercept $\alpha_{D[i]}$) vs urban (including an additional intercept $\beta_{D[i]}$ for urban districts) groups. These plow was generated in Lecture 13 - Multilevel Adventures [blocked].

The plot shows:

  • Rural populations are, on average are less likely to use contraception than urban areas
    • the dashed line in the top plot is lower than that in the lower plot
  • There are fewer urban samples, so there is a larger range of uncertaint for urban areas
    • the error blue bars associated with urban popluations are larger than for rural areas, where there were more polls sampled.

Urbanism is a Correlated Feature

Below we plot the probability of contraceptive use for urban $p(C | U=1)$ vs rural $p(C|U=0)$ areas. These plow was generated in Lecture 13 - Multilevel Adventures [blocked].

The plot shows that contraceptive use between urban/rural observations is correlated (cc>0.7)

šŸ¤”

  • If we know about the contraceptive use in rural areas, you can make a better guess about urban contraceptive use because of this correlation
  • We are leaving information on the table by not letting the Golem--i.e. our statistical model--leverage this feature correlation

This lecture focuses on building statistical models that can capture correlation information amongst features.

  • makes inference more efficient by using partial pooling across features

Estimating Correlation and Partial Pooling Demo

McElreath goes on to show a demo of Bayesian updating in a model with a correlated features. Ιt's great, and I recommend going over the demo a few times, but I'm too lazy to implement it (nor clever enough to do so without an animation, which I'm trying to avoid). It'll go on the TODO list.

Adding Correlated Features

  • One prior distribution for each cluster/subgroup $\rightarrow a_j \sim \text{Normal}(\bar a , \sigma)$; allows partial pooling
  • If one feature, then one-dimensional prior
    • $a_j \sim \text{Normal}(\bar a, \sigma)$
  • N-dimensional distribution for N features
    • rather than independent distributions (as we've been doing in previous lectures), we use joint distributions for features
    • $a_{j, 1..N} \sim \text{MVNormal}(A, \Sigma)$
    • e.g. $[\alpha_j, \beta_j] \sim \text{MVNormal}([\bar \alpha, \bar \beta], \Sigma_{\alpha, \beta})$
  • Hard part: learning associations/correlations amongst features
    • "Estimating means is easy"
    • "Estimating standard deviations is hard"
    • "Estimating correlations is very hard"
    • That said, usually good idea to try to lear correlations:
      • if you don't have enough samples / signal, you'll fall back on the prior

Previous model -- Using uncorrelated urban features

This is the uncentered implementation, more-or-less copy-pasted from the previous Lecture

Demonstrate that feature priors are independent in uncorrelated model

Below we sample from the prior, and show that the $\alpha$ and $\beta$ parameters for all districts are aligned with the cardinal axes, indicating no correlation

Above we can see the priors for the uncorrelated model provide. We can tell they are uncorrelated because

  • the primary axes of the distribution lie orthogonally along the axes of parameter values.
  • i.e. there's no "tilt" of th PDF countour

Multivariate Normal Prior

[αj,βj]=MVNormal([αˉ,βˉ],R,[σ,Ļ„])[αj,βj]:=featuresĀ forĀ districtĀ jR:=correlationĀ matrixĀ amongstĀ features[σ,Ļ„]:=(independent)Ā standardĀ deviationsĀ forĀ eachĀ parameterĀ j\begin{align*} [\alpha_j, \beta_j] &= \text{MVNormal}([\bar \alpha, \bar \beta], R, [\sigma, \tau]) \\ [\alpha_j, \beta_j]&:= \text{features for district } j \\ R &:= \text{correlation matrix amongst features} \\ [\sigma, \tau] &:= \text{(independent) standard deviations for each parameter } j \\ \end{align*}

Samples from multivariate prior

Model with correlated urban features

Ci∼Bernouill(pi)pi=αD[i]+βD[i]Ui[αj,βj]∼MVNormal([αˉ,βˉ],R,[σ,Ļ„])σ∼Exponential(1)Ļ„āˆ¼Exponential(1)R∼LKJCorr(4)\begin{align*} C_i &\sim \text{Bernouill}(p_i) \\ p_i &= \alpha_{D[i]} + \beta_{D[i]} U_i \\ [\alpha_j, \beta_j] &\sim \text{MVNormal}([\bar \alpha, \bar \beta], R, [\sigma, \tau]) \\ \sigma &\sim \text{Exponential}(1) \\ \tau &\sim \text{Exponential}(1) \\ R &\sim \text{LKJCorr}(4) \end{align*}

On the LKJCorr prior

  • Named after Lewandowski-Kurowicka-Joe, the authors of the original paper on the distribution
  • Prior for correlation matrices
  • Takes a shape parameter $\eta>0$
    • $\eta=1$ results in unifrom distribution over correlation matrices
    • $\eta \rightarrow \infty$, R approaches an identity matrix
  • changes the distribution of correlation matrices

Correlation Matrices Sampled from LKJCorr

we can see that as $\eta$ becomes larger, the axes of the correlation matrices become more alighed with the axes of the parameter space, providing diagonal covariance samples

Model with correlated features

A couple of notes āš ļø

  • We jump straight to the implementation with uncentered priors here, skipping the centered version. For details on the uncentered implementation, see the PyMC docs.
  • We use LKJCholeskyCov instead LKJCorr for improved numerical stability and performance. For details, see https://discourse.pymc.io/t/uses-of-lkjcholeskycov-and-lkjcorr/3629/3
  • We don't need to implement the Chol_to_Corr function in the lecture, as we can just pull the correlations directly out of the LKJCholeskyCov distribution

Analyzing feature correlation

Compare to prior

Comparing to the model that does not model feature correlations: the model with uncorrelated features exhibits a much weaker negative correlation (i.e. the red ellipse on the left is less "tilted" downward.

Comparing models on the outcome scale.

  • Modeling feature correlation allows urban areas to capture urban variance differently. Though the effect is subtle, the correlated feature model (top) exhibits smaller Inner-percentile range (IPR) (0.45-0.65) than the uncorrelated feature model (0.41-0.65).
  • āš ļø This is actually the opposite of what was reported in the lecture. My intuition would be that there would be less posterior variability because we're sharing information across features, and thus reducing our uncertainty. But I could easily be doing or interpreting something incorrectly here.

Looking at outcome space

Including feature correlations

  • Points move because the transfer of information across features provides better estimates of $p(C)$
  • There's a negative correlation between rural areas and the difference between urban contraceptive use in urban areas
    • When a district's rural areas have high contraceptive use, the difference will be smaller (i.e. urban areas will also have high contraceptive use)
    • When rural contraceptive use is low, urban areas will be more different (i.e. urban areas will still tend to have higher contraceptive use)

Review: Correlated Varying Effects

Priors that learn correlation amongst features. Provides the following benefits

  1. partial pooling across features -- memory, shared information improves efficiency
  2. learns correlations -- this is the focus of research questions

šŸ’” Varying effects can be correlated and can still be learned without priors that incorporate correlation structure. However, incorporating correlations explicitly allows partial pooling, and is thus more efficient and provides more explicit information about those correlations.

BONUS: Non-centered (aka Transformed) Priors

Inconvenient Posteriors

  • Inefficient MCMC caused by steep curvature in the neg-log-likelihood
  • Hamiltonian Monte Carlo has trouble exploring steep surface
    • leading to "divergent" transistions
    • "punching through wall of skate park"
    • detected when the Hamiltonian changes value dramatically between start/end of proposal
  • Transforming the prior to be a "smoother skate park" helps

Example: Devil's Funnel prior

v∼Normal(0,σv)x∼Normal(0,exp⁔(v))\begin{align*} v &\sim \text{Normal}(0, \sigma_v) \\ x &\sim \text{Normal}(0, \exp({v})) \end{align*}

As $\sigma_v$ increases, a nasty trough forms in the prior probability surface that is difficult for Hamiltonian dynamics to sample--i.e. the skatepark is too steep and narrow.

I'm too lazy to code up the fancy HMC animation that McElreath uses to visualize each divergent path. However, we can still verifty that the number of divergences increases when sampling the devil's funnel as we increase the v prior's $\sigma$ in the devil's funnel prior.

Centered-prior models

What to do?

  • Smaller step size: handles steepness better, but takes longer to explore the posterior
  • Re-parameterize (tranform) to make surface smoother

Non-centered prior

We add an auxilary variable $z$ that has a smooth probability surface. We then sample that auxilary variable, and transform it to obtain the target variable distribution. For the case of the devi's funnel prior:

v∼Normal(0,σv)z∼Normal(0,1)x=zexp⁔(v)\begin{align*} v &\sim \text{Normal}(0, \sigma_v) \\ z &\sim \text{Normal}(0, 1) \\ x &= z \exp(v) \end{align*}
  • By reparameterizing, we get to sample multi-dimensional Normal distributions, which are smoother parabolas in the log space.

  • We can see that the number of divergence is consistently reduced for all values of prior variance.

Check HMC Diagnostics

Diagnostics look good šŸ‘

  • Rhats = 1
  • $v$ and $z$ chains exhibit "fuzzy caterpillar" behavior

Cholesky Factors & Correlation Matrices

  • Cholesky factor, $\bf L$ provides an efficient means for encoding a correlation matrix $\Omega$ (requires fewer floating point numbers than the full correlation matrix)
  • $\Omega = \bf LL^T$
  • $\bf L$ is a lower-triangular matrix
  • we can sample data with correlation $\Omega$, and standard deviations $\sigma_i$, using the Cholesky factorization $\bf L$ of $\Omega$ as follows:
XΩ∼diag(σi)LZ,X_\Omega \sim \text{diag}(\sigma_i)\bf{LZ},

where $\bf Z$ is a matrix of z-scores sampled from a standard normal distribution

Demonstrate reconstructing correlation matrix from Cholesky factor

Demonstrate sampling random corrleation matrix from z-scores and the Cholesky factor

When to use Transformed Priors

  • depends on context
  • Centered:
    • Lot's of data in each cluster/subgroup
    • likelihood-dominant scenario
  • Non-centered:
    • Many clusters/subgroups, with sparse data in some of the clusters
    • prior-dominant scenario

Authors

  • Ported to PyMC by Dustin Stansbury (2024)
  • Based on Statistical Rethinking (2023) lectures by Richard McElreath

:::{include} ../page_footer.md :::