(lecture_08)=

Markov Chain Monte Carlo

:::{post} Jan 7, 2024 :tags: statistical rethinking, bayesian inference, mcmc, hamiltonian monte carlo :category: intermediate :author: Dustin Stansbury :::

This notebook is part of the PyMC port of the Statistical Rethinking 2023 lecture series by Richard McElreath.

Video - Lecture 08 - Markov Chain Monte Carlo# Lecture 08 - Markov Chain Monte Carlo

Real-world research

The real world is complicated, and requires more complex causal models. Numerical methods that leverage random number generators provide the most practical way (currently) to estimate the posterior distributions for more complex causal models

Complexities due measurment and noise

Example: marriage & divorce rate dataset

The graph we analyzed in Lecture 07

However, the real-world graph is likely much more complex

  • We don't actually observe Marriage $M$, Divorce $D$, and Age $A$ directly, but rather noisy measurements associated with each $M*, D*, A*$.
  • Besides measurement noise, there's also varying reliability due to the underlying Population $P$ that can affect our estimates of $M*, D*, A*$ take (e.g. smaller states have fewer observations, and thus less reliable estimates).

Complexities due to latent, unobserved causes

Example Testing Scores

  • Test scores $S$ are a function of
    • observed student features $X$
    • unobserved knowledge state $K$
    • test quality/difficulty $Q$

Social Networks

  • relationship from person or group $A$ to $B$, $R_{AB}$ may result in some outcome involving both $Y_{AB}$
    • e.g. $A$ could be an admirer of $B$
  • similarly, but not necessarily symetrically $R_{BA}$ may result in some outcome involving both $Y_{AB}$
    • e.g. $B$ may not be an admirer of $A$
  • There are also feartures of the people or groups $X_*$ that affect both the relationships and the outcomes
    • e.g. the wealth status of $A$ and/or $B$
  • The interactions $Y_*$ do not occur between some other person or group $C$ presumably
  • All of these non-outcome variables are potentially unobserved
  • because of some social interaction structure. Given that we can only measure some of these outcomes, we must infer the network structure

Problems and Solutions

Additional Complexities in Real-world Research

The following can make calculating posteriors difficult

  • many unknowns and unobserved
  • nested relationships -- so far models have been more-or-less additive, this is not always the case
  • bounded outcomes -- parameters that should have limited support
  • difficult posterior calculation
    • analytical approach -- rarely can we rely on closed-form mathematical solutions (e.g. conjugate prior scenarios)
    • grid approximation -- only works for small problems (curse of dimensionality)
    • quadratic approximation -- only works for subset of problems (e.g. multivariate Gaussian posteriors)
    • MCMC - general, but computationally expensive -- it has become more accessible with compute as of late.

AnAlYzE tHe DaTa

Here we dig into the black box used to estimate posteriors during the estimation phase of the workflow.

Computing the posterior

  1. Analytical Approach -- often impossible (outside of conjugate priors)
  2. Grid Approximation -- only pragmatic for small problems with few dimensions
  3. Quadratic Approximation -- only works for posteriors that can be approximated by mult-variate Gaussians (no discrete variables)
  4. Markov Chain Monte Carlo -- intensive, but not beyond modern computing capabilities

King Markov vists the Metropolis Archipelago (aka Metropolis algorithm)

  1. flip a coin to choose to go to left or right island
  2. find population of proposal island $p_{proposal}$
  3. find population of current island $p_{current}$
  4. move to the proposal island with probability $\frac{p_{proposal}}{p_{current}}$
  5. repeat 1-4
  • In the long run, this process will visit each island in proportion to its population
  • If we use island population as a metaphor for probability density, then we can use the same algorithm to sample from any probability distribution (of any number of dimensions) using a random number generator
  • For Bayesian inference, we apply the the Metropolis alogirthm assuming:
    • islands $\rightarrow$ parameter values
    • population size $\rightarrow$ posterior probability

Markov Chain Monte Carlo

  • Chain: sequence of draws from a proposal distribution
  • Markov Chain: history doesn't matter, only the current state of the parameter
  • Monte Carlow: a reference to randomization a' la' gambling in the town of Monte Carlo

Verify Algorithm approximates Gaussian distribution centered on center island

Verify on Poisson distribution

Verify for more exotic Bimodal distribution

Shoutout to Arianna Rosenbluth

  • performed the actual implementation of the original MCMC algorithm in code
  • implemented on MANIAC computer in the 1950s
  • We should probably call it the Rosenbluth algorithm

Basic Rosenbluth Algorithm

Effect of step size

Metropolis algorithm is great in that it

  • is quite simple
  • is general -- it can be applied to any distribution

However, it comes with some major downsides

  • As shown above, there is a tradeoff between exploring the shape of the distribution, and accepting samples.
    • this only gets worse in models with higher dimensional parameters space.
  • Large rejection rate of samples can make the Metropolis the algorithm prohibitive to use in practice.

Hamiltonian aka "Hybrid" Monte Carlo (HMC) 🛹

HMC is a more efficient sampling algorithm that takes advantage of gradient information about the posterior in order to make sampling more efficient -- exploring more of the space, while also accepting more proposed samples. It does so by associating each location in parameter space with a position in a physical simulation space. We then run a physics simulation using Hamiltonian dynamics, which is formulated in terms of position and momentum of a particle traveling on a surface. At any point on the surface the particle will have a potential energy U and kinetic energy K that are traded back and forth due to conservation of energy. HMC uses this energy conservation to simulate the dynamics.

HMC requires a few components:

  • A function U(position) that returns the the "potential energy" from the physical dynamics perspective. This is the same to the negative log posterior given the data and current parameters.
  • The gradient of U(position), dU(position), which provides the partial derivatives with respect to potential energy / parameters, given the data samples and the current parameters state.
  • A function K(momentum) that returns the potential energy given the current momentum of the simulated particle.

For more details, see my (now ancient) blog post on MCMC: Hamiltonian Monte Carlo

Each step of HMC involves "flicking" a partical, inducing it with a particular momentum (velocity and direction), then allowing the partical to "roll" around the surface of the energy landscape. The direction the partical takes at any point in time is given by the function dU(). To simulate nonlinear motion along the landscape, we take multiple consecutive linear steps (aka "leapfrog" steps). After a set number of steps are taken, we than calcualte the total energy of the system, which is a combination of potential energy, given by U(), and kinetic energy, which can be calculated directly from the state of the momementum. If the potential energy state (negative log posterior) after simulation is lower than the potential energy of the initial state (before the dynamics), then we randomly accept the new state of the parameters in proportion to the energy change.

Calculus is a Superpower

Autodiff

For the HMC examples above, our (log) posterior / potential energy had a specific functional form (namely a mult-dimensional Gaussian), and we we had to use specific gradients and code for calculating those gradients that are aligned with that posterior. This custom approach to inference is approach generally prohibitive in that

  • it takes a lot of work to calculate the gradient with respect to each parameter
  • you have to recode (and debug) the algorithm for each model

Autodiff allows us to use any posterior function, giving us the gradients for free. Basically the function is broken down into a computation chain of the most basic mathematical operations. Each basic operation has its own simple derivative, so we can chain these derivatives together to provide the gradient of the entire function with respect to any parameter automatically.

Stan / PyMC

McElreath uses Stan (accessed via R) to access autodiff and run general Bayesian inference. We're use PyMC, an analogous probabilistic programming language (PPL).

2012 New Jersey Wine Judgement

  • 20 wines
    • 10 French, 10 NJ
  • 9 French & American Judges
  • 180 total scores
    • wines
      • 100 NJ
      • 80 FR
    • judges
      • 108 NJ
      • 72 FR

Estimand

  • Association between Wine Quality and Wine Origin
  • Stratify by judge (not required) for efficiency

Simplest Model

Make sure to start simple

SiNormal(μi,σ)μi=QW[i]QW[i]Normal(0,1)σExponential(1)\begin{align*} S_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= Q_{W[i]} \\ Q_{W[i]} &\sim \text{Normal}(0, 1) \\ \sigma &\sim \text{Exponential}(1) \end{align*}

Note that Quality $Q_{W_[i]}$ is an unobserved, latent variable.

Fit the simple, wine-specific model

Posterior Summary

Drawing the Markov Owl 🦉

Convergence Evaluation

  1. Trace Plots
  2. Trace Rank Plots
  3. R-hat Convergence Measure
  4. n_effective Samples
  5. Divergent Transitions

1. Trace Plots

Look for

  • "fuzzy caterpillar"
    • "wandering" paths indicate too small step size (large number of acceptance, but exploring only local space)
    • long periods of same value indicate too large step size (large number of rejections, not exploring space)
  • Each param should have similar density across chains

Looking at all parameters separately

so many fuzzy caterpillars!

If HMC is sampling correctly, you'll only need a few hundred samples to get a good idea of the shape of the posterior. This is because, under the correct initial conditions and model parameterization, HMC is very efficient.

That's a lot of information! Try Compact Form

2. Trace Rank "Trank" Plots

  • Don't want any one chain having largest or smallest rank for extended period of time.
  • "jumbled up" is good

Having all Trank plots associated with each chain hovering over around the average (dotted lines) is indicative of a set of well-mixed, efficient, Markov chains.

3. R-hat

  • Chain-convergence metric
  • The variance ratio comparing within-chain variance to across-chain variance
  • Similar (opposite) to metrics used to train k-means
  • Idea is that if chains have converged, and are exploring the same space, then their within-chain variance should be more-or-less the same as the acorss-chain variance, providing an R-hat near 1.0
  • No guarantees. This IS NOT A TEST.
    • no magical values; but be wary of values over 1.1 or so 🤷‍♂️

4. Effective Sample Size (ESS)

  • the are ess_*, where tail and bulk are different methods for estimating the effective samples
  • tail seems to give ESS samples after burnin
  • idea is to get an estimate of the number of samples available that are close to being independent to one another
  • tends to be smaller than the number of actual samples run (because actual samples aren't actually independent in real world)*_

Show diagnostics

ESS Plots

Local ESS Plot
  • Higher local ESS the better, indicating more efficient Markov Chains
  • Efficient Markov Chains should demonstrate Relative ESS values that
ESS Evolution Plot
  • For Markov Chains that are converging, ESS should increase consistently with the number of draws from the chain
  • ESS vs draws should be roughly linear for both bulk and tail

More complete model, stratify by Wine Origin, $O_{X[i]}$

SiNormal(μi,σ)μi=QW[i]+OX[i]QW[i]Normal(0,1)OX[i]Normal(0,1)σExponential(1)\begin{align*} S_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= Q_{W[i]} + O_{X[i]}\\ Q_{W[i]} &\sim \text{Normal}(0, 1) \\ O_{X[i]} &\sim \text{Normal}(0, 1) \\ \sigma &\sim \text{Exponential}(1) \end{align*}

Fit the wine origin model

MCMC Diagnostics

Posterior Summary

Including Judge Effects

  • Looking at the DAG (above), we see that judges descrimination is a competing cause of the measured score.
  • We indclude Judge effects using an Item Response-type model for the mean score: $\mu_i = (Q_{W[i]} + O_{X[i]} - H_{J[i]})D_{J[i]}$
  • where the mean $\mu_i$ of the score $S_i$ is modeled as a linear combination of
    • wine quality $Q_{W[i]}$
    • wine origin $O_{X[i[}$
    • judge's baseline scoring harshness $H_{J[i]}$
      • values $< 0$ indicates a judge is less likely than average to judge a wine badly
    • all of this is modulated (multiplied) by the judge's discrimination ability $D_{J[i]}$
      • a value $> 1$ indicates a judge provides larger scores discrepancies for differences in wine
      • a value $< 1$ indicates a judge gives smaller scores discrepancies for differences in wine
      • a value of 0 indicates a judge gives all wines an average score

Statistical model

SiNormal(μi,σ)μi=(QW[i]+OX[i]HJ[i])DJ[i]QW[i]Normal(0,1)OX[i]Normal(0,1)HX[i]Normal(0,1)DX[i]Exponential(1)σExponential(1)\begin{align*} S_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= (Q_{W[i]} + O_{X[i]} - H_{J[i]})D_{J[i]} \\ Q_{W[i]} &\sim \text{Normal}(0, 1) \\ O_{X[i]} &\sim \text{Normal}(0, 1) \\ H_{X[i]} &\sim \text{Normal}(0, 1) \\ D_{X[i]} &\sim \text{Exponential}(1) \\ \sigma &\sim \text{Exponential}(1) \end{align*}

Fit the judge model

Markov Chain Diagnostics

Posterior summary

Does Wine Origin Matter?

not really, at least when comparing posterior variable for wine origin, $O$

Calculating the contrast distribution

Always be contrasting

The posterior contrast supports the same results. There is little difference--if any--between the quality of NJ and FR wine.

Words of Wisdom from Gelman

When you have computational problems, often there's a problem with your model.

Pay attention to

  • the scientific assumptions of your model
  • priors
  • variable preprocessing
  • sampler settings
  • Be weary of Divergences
    • These are HMC simulations that violate Hamiltonian conditions; unreliable proposals
    • Divergences are usually a symptom of a model that is designed incorrectly or that can be parameterized for to improve sampling efficiency (e.g. non-centered priors)

Shout out to Spiegelhalter! 🐞🐛🪲

  • BUGS package started the desktop MCMC revolution

Authors

  • Ported to PyMC by Dustin Stansbury (2024)
  • Based on Statistical Rethinking (2023) lectures by Richard McElreath

:::{include} ../page_footer.md :::