(lecture_08)=
:::{post} Jan 7, 2024 :tags: statistical rethinking, bayesian inference, mcmc, hamiltonian monte carlo :category: intermediate :author: Dustin Stansbury :::
This notebook is part of the PyMC port of the Statistical Rethinking 2023 lecture series by Richard McElreath.
Video - Lecture 08 - Markov Chain Monte Carlo# Lecture 08 - Markov Chain Monte Carlo
The real world is complicated, and requires more complex causal models. Numerical methods that leverage random number generators provide the most practical way (currently) to estimate the posterior distributions for more complex causal models
The following can make calculating posteriors difficult
Here we dig into the black box used to estimate posteriors during the estimation phase of the workflow.
HMC is a more efficient sampling algorithm that takes advantage of gradient information about the posterior in order to make sampling more efficient -- exploring more of the space, while also accepting more proposed samples. It does so by associating each location in parameter space with a position in a physical simulation space. We then run a physics simulation using Hamiltonian dynamics, which is formulated in terms of position and momentum of a particle traveling on a surface. At any point on the surface the particle will have a potential energy U and kinetic energy K that are traded back and forth due to conservation of energy. HMC uses this energy conservation to simulate the dynamics.
U(position) that returns the the "potential energy" from the physical dynamics perspective. This is the same to the negative log posterior given the data and current parameters.U(position), dU(position), which provides the partial derivatives with respect to potential energy / parameters, given the data samples and the current parameters state.K(momentum) that returns the potential energy given the current momentum of the simulated particle.For more details, see my (now ancient) blog post on MCMC: Hamiltonian Monte Carlo
Each step of HMC involves "flicking" a partical, inducing it with a particular momentum (velocity and direction), then allowing the partical to "roll" around the surface of the energy landscape. The direction the partical takes at any point in time is given by the function dU(). To simulate nonlinear motion along the landscape, we take multiple consecutive linear steps (aka "leapfrog" steps). After a set number of steps are taken, we than calcualte the total energy of the system, which is a combination of potential energy, given by U(), and kinetic energy, which can be calculated directly from the state of the momementum. If the potential energy state (negative log posterior) after simulation is lower than the potential energy of the initial state (before the dynamics), then we randomly accept the new state of the parameters in proportion to the energy change.
For the HMC examples above, our (log) posterior / potential energy had a specific functional form (namely a mult-dimensional Gaussian), and we we had to use specific gradients and code for calculating those gradients that are aligned with that posterior. This custom approach to inference is approach generally prohibitive in that
Autodiff allows us to use any posterior function, giving us the gradients for free. Basically the function is broken down into a computation chain of the most basic mathematical operations. Each basic operation has its own simple derivative, so we can chain these derivatives together to provide the gradient of the entire function with respect to any parameter automatically.
McElreath uses Stan (accessed via R) to access autodiff and run general Bayesian inference. We're use PyMC, an analogous probabilistic programming language (PPL).
Make sure to start simple
Note that Quality $Q_{W_[i]}$ is an unobserved, latent variable.
Look for
so many fuzzy caterpillars!
If HMC is sampling correctly, you'll only need a few hundred samples to get a good idea of the shape of the posterior. This is because, under the correct initial conditions and model parameterization, HMC is very efficient.
Having all Trank plots associated with each chain hovering over around the average (dotted lines) is indicative of a set of well-mixed, efficient, Markov chains.
ess_*, where tail and bulk are different methods for estimating the effective samplestail seems to give ESS samples after burninbulk and tailnot really, at least when comparing posterior variable for wine origin, $O$
The posterior contrast supports the same results. There is little difference--if any--between the quality of NJ and FR wine.
When you have computational problems, often there's a problem with your model.
Pay attention to
:::{include} ../page_footer.md :::