(lecture_16)=

Gaussian Processes

:::{post} Jan 7, 2024 :tags: statistical rethinking, bayesian inference, gaussian processes :category: intermediate :author: Dustin Stansbury :::

This notebook is part of the PyMC port of the Statistical Rethinking 2023 lecture series by Richard McElreath.

Video - Lecture 16 - Gaussian Processes# Lecture 16 - Gaussian Processes

Returning to Oceanic Technology Dataset

We'll now deal with some of the confounds we glossed over, now that we have some more advanced statistical tools

  • Total tools in society $T$ (outcome)
  • Populuation $P$ (treatment)
  • Contact level $C$
  • Unobserved confounds $U$, e.g.
    • materials
    • weather

There should be spatial covariation:

  • islands closer to one another should have similar unobserved confounds
  • islands should share historical context
  • we don't have to know the exact dymanics of the contexts, we can handle statistically at a macro scale

Review of Techonological Innovation/Loss model

  • Derived from steady state of difference equation $\Delta T = \alpha P^ \beta - \gamma T$
    • $\alpha$ is rate of innovation
    • $P$ is population
    • $\beta$ is elasticity rate (diminishing returns)
    • $\gamma$ is loss (e.g. forgetting) rate over time
  • Provides the expected number of tools over the long-run:
    • $\hat T = \frac{\alpha P ^ \beta}{\gamma}$
    • use $\lambda= \hat T$ as the expected rate parameter for a $\text{Poisson}(\lambda)$ distribution
  • How can we encode spatial covariance in this model?

Let's start of with an easier model that ignores population

TiPoisson(λi)λi=αˉ+αS[i](α1α2α10)=MVNormal([000],K)\begin{align*} T_i &\sim \text{Poisson}(\lambda_i) \\ \lambda_i &= \bar \alpha + \alpha_{S[i]} \\ \begin{pmatrix} \alpha_1 \\ \alpha_2 \\ \vdots \\ \alpha_{10} \end{pmatrix} &= \text{MVNormal} \left( \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix}, \textbf{K} \right) \end{align*}
  • $\alpha_{S[i]}$ are the variable intercepts for each society
    • These will have covariation based on their spatial distance from one another
    • closer societies have similar offsets
  • $\mathbf{K}$ the covariance amongst societies
    • models covariance as a function distance
    • though $\mathbf{K}$ has many parameters (45 covariances), we should be able to leverage spatial regularity to estimate far fewer effective parameters. This is good b.c. we only have 10 data points 😅

Gaussian Processes (in the abstract)

  • Uses a kernel function to generate covariance matrices

    • Requires far fewer parameters than the covariance matrix entries
    • Can generate covariance matrices of any dimenions (i.e. "infinite dimensional generalization of the MNNormal")
    • Can generate a prediction for any point
  • Kernel function calculates the covariance of two points based on metric of comparison, e.g.

    • spatial distance
    • difference in time
    • difference in age

Different Kernel Functions

The Quadratic (L2) Kernel Function

  • aka RGB
  • aka Gaussian Kernel
K(xi,xj)=η2exp((xixj)2σ2)K(x_i, x_j) = \eta^2 \exp \left(- \frac{(x_i - x_j)^2}{\sigma^2} \right)

Ornstein-Uhlenbeck Kernel

K(xi,xj)=η2exp(xixjσ)σ=1ρ\begin{align*} K(x_i, x_j) &= \eta^2 \exp \left(- \frac{|x_i - x_j|}{\sigma} \right) \\ \sigma &= \frac{1}{\rho} \end{align*}

Periodic Kernel

K(xi,xj)=η2exp(sin2((xixj)/p)σ2)K(x_i, x_j) = \eta^2 \exp \left(- \frac{\sin^2((x_i - x_j) / p)}{\sigma^2} \right)

with additional periodicity parameter, $p$

Gaussian Process Prior

Using the Gaussian/L2 Kernel Function to generate our covariance $\mathbf{K}$, we can adjust a few parameters

  • Varying $\sigma$ controls the bandwidth of the kernel function

    • smaller $\sigma$ makes covariance fall off quickly with space
    • larger $\sigma$ allow covariance to extend larger distances
  • Varying $\eta$ controls the maximum degree of covariance

1-D Examples

Below we draw functions from a 1-D Guassian process, varying either $\sigma$ or $\eta$ to demonstrate the effect of the parameters on the spatial correlation of the samples.

Varying $\sigma$

Varying $\eta$

Gaussian Process Posterior

As the model observes data, it adjusts the posterior to account for that data while also incorporating the smoothness constraints of the prior.

Distance-based model

TiPoisson(λi)log(λi)=αˉ+αS[i](α1α2α10)=MVNormal([000],K)ki,j=η2exp(ρ2Di,j2))η2Exponential(2)ρ2Exponential(0.5)\begin{align} T_i &\sim \text{Poisson}(\lambda_i) \\ \log(\lambda_i) &= \bar \alpha + \alpha_{S[i]} \\ \begin{pmatrix} \alpha_1 \\ \alpha_2 \\ \vdots \\ \alpha_{10} \end{pmatrix} &= \text{MVNormal}\left( \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix}, \textbf{K} \right) \\ k_{i,j} &= \eta^2\exp(-\rho^2D_{i,j}^2)) \\ \eta^2 &\sim \text{Exponential}(2) \\ \rho^2 &\sim \text{Exponential}(0.5) \end{align}

Islands Distance Matrix

Check model with prior-predictive simulation

Model the data

Posterior Predictions

Compare the Prior to the Posterior

Ensure that we've learned something

Stratify by population size

TiPoisson(λi)λi=αˉPβγexp(αS[i])(α1α2α10)=MVNormal([000],K)ki,j=η2exp(ρ2Di,j2))αˉ,β,γExponential(1)η2Exponential(2)ρ2Exponential(0.5)\begin{align} T_i &\sim \text{Poisson}(\lambda_i) \\ \lambda_i &= \frac{\bar \alpha P^\beta}{\gamma} \exp(\alpha_{S[i]}) \\ \begin{pmatrix} \alpha_1 \\ \alpha_2 \\ \vdots \\ \alpha_{10} \end{pmatrix} &= \text{MVNormal} \left( \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix}, \textbf{K} \right) \\ k_{i,j} &= \eta^2\exp(-\rho^2D_{i,j}^2)) \\ \bar \alpha, \beta, \gamma &\sim \text{Exponential}(1) \\ \eta^2 &\sim \text{Exponential}(2) \\ \rho^2 &\sim \text{Exponential}(0.5) \end{align}
  • We now fold our steady state equation into the intercept/baseline tools equation
  • Include an additional $\exp$ multiplier to handle society-level offsets. Idea here being that
    • if $\alpha_{S[i]}=0$, then the society is average, and we fall back to the expected value of the difference equation i.e. multiplier is one
    • if $\alpha_{S[i]}>0$ then there is a boost in tool production. This boost is shared amongst spatially-local societies

Prior Predictive

We can see that by including population, the degree of covariance required to explain tool use is diminished, compared to the distance-only model. i.e. population has explained away a lot more of the variation.

Fit population-only model for comparison

I think this is what McElreath is doing to construct the plot that combines population size, tool use and island covariance (next plot below). Specifically, we estimate a population-only model using the analytical solution, but includes no covariance amongst islands. We then compare add the covariance plot on top of that model to demonstrate that there is still some information left on the table regarding covariances, even when including population in the model.

Though Population alone accounts for a lot of the variance in tool use (blue curve), there's still quite a bit of residual island-island similarity that explains tool use (i.e. some of the black lines still have some weight).

Phylogenetic Regression

Dataset: Primates301

  • 301 Primate species
  • Life history traits
  • Body Mass $M$, Brain Size $B$, Social Group Size $G$
  • Measurement error, unobserved confounding
  • Missing data, we'll fucus on the complete case analysis in this lecture

Filter out 151 Complete Cases

Drop osbservations missing either Body Mass, Brain Size, or Group size

There's lot's of Covariation amongst variables

There's no way from the sample alone to know what causes what

Causal Salad

  • Throwing all predictors into a model, then interpreting their coefficients causally
  • Using prediction methods as a stand-in for causality
  • Throwing all factors into "salad" of regression terms
    • performing model selection base on prediction (we showed earlier in Lecture 07 - Fitting Over & Under [blocked] why this is bad to do)
    • and interpreting the resulting coefficients as causal effects (we've shown why this is bad many times)
  • Controlling for phylogeny is important to do, but it's often done mindlessly (with the "salad" approach)
  • Regression with phylogeny still requires a causal model

Different Hypotheses required different causal models (DAGs)

Social Brain Hypothesis

  • Group Size $G$ Effects Brain Size $B$

Body Mass $M$ is a shared cause for both $M$ and $B$

This is the DAG we'll focus on in this Lecture

Body Mass $M$ is a mediator for both $M$ and $B$

Brain Size $B$ actually causes Group Size $G$; $M$ still a shared cause of both

Unobserved confounds make naive regression on traits w/out controlling for history is dangerous

  • History induces a number of unobserved shared confounds that we need to try to control for.
  • phylogenetic history can work as a proxy for these shared confounds.
  • We only have current measurements of outcome of history
    • actual phylogenies do not exist
    • different parts of a genome can have different histories

BUT, say that we do have an inferred phylogeny (as we do in this lecture), we use this phylogeny, in conjunction with Gaussian Process Regression to model the similarity amongst species as a proxy for shared history

Phylogenetic Regression

  • There's a long history of genetic evolution
  • We only get to measure the current end state of that process
  • In principle, common genetic histories should induce covariation amongst species
  • We should hopefully be able to model this covariation to average over all the possible histories (micro state) that could have led to genetic similarity (macro state)

Two conjoint problems

  1. What is the history (phylogeny)?
  2. How do we use it to model causes and control for confounds?

1. What is the history (phylogeny)?

How do we estimate or identify a phylogeny?

Difficulties
  • high degree of uncertainty
  • processes are non-stationary
    • different parts of the genome have different histories
    • crossing-over effects
  • available inferential tools
    • exploring tree space is difficult
    • repurposing software from other domains for studying biology
  • phylogenies (just like social networks) do not exist.
    • They are abstractions we use to capture regularities in data

2. How do we use it to model causes?

...say we've obtained a phylogeny, now what?

Approaches
  • no default approach
  • Gaussian Processes are default approach
    • use phylogeny as proxy for "genetic distance"

Two equivalent formulations of Linear Regression

You can always re-express a linear regression with a draw from a multi-variate Normal distribution

Classical Linear Regression

The dataset $\textbf{B}$ is modeled as $D$ independent samples $B_i$ from a random a univariate normal.

BiNormal(μi,σ2)μi=α+βGGi+βMMiαNormal(0,1)βG,MNormal(0,1)σExponential(1)\begin{align} B_i &\sim \text{Normal}(\mu_i, \sigma^2) \\ \mu_i &= \alpha + \beta_G G_i + \beta_M M_i \\ \alpha &\sim \text{Normal}(0, 1) \\ \beta_{G, M} &\sim \text{Normal}(0, 1) \\ \sigma &\sim \text{Exponential}(1) \end{align}

$\text{MVNormal}$ Formulation

The dataset is modeled as a single $D$-vector $\textbf{B}$ sampled from a $\text{MVNormal}$. Has more of a Gaussian-process flavor to it; allows us to formulate the Linear regressoin in order to include covariation amongst the outputs.

BMVNormal(μ,K)K=Iσ2μi=α+βGGi+βMMiαNormal(0,1)βG,MNormal(0,1)σExponential(1)\begin{align} \textbf{B} &\sim \text{MVNormal}(\mu, \textbf{K}) \\ \mathbf{K} &= \mathbf{I}\sigma^2 \\ \mu_i &= \alpha + \beta_G G_i + \beta_M M_i \\ \alpha &\sim \text{Normal}(0, 1) \\ \beta_{G, M} &\sim \text{Normal}(0, 1) \\ \sigma &\sim \text{Exponential}(1) \end{align}

Classic Linear Regression

$\text{MVNormal}$ Linear Regression

Verify the two formulations provide the same parameter estimates

We're able to recover the same estimates as in the lecture

From Model to Kernel

We'd like to incorporate some residual $u_i$ into our linear regression that adjusts the expecation in a way that encodes the shared history species

BMVNormal(μ,K)K=Iσ2μi=α+βGGi+βMMi+uiαNormal(0,1)βG,MNormal(0,1)σExponential(1)\begin{align} \textbf{B} &\sim \text{MVNormal}(\mu, \textbf{K}) \\ \mathbf{K} &= \mathbf{I}\sigma^2 \\ \mu_i &= \alpha + \beta_G G_i + \beta_M M_i + u_i\\ \alpha &\sim \text{Normal}(0, 1) \\ \beta_{G, M} &\sim \text{Normal}(0, 1) \\ \sigma &\sim \text{Exponential}(1) \end{align}

Phylogenetic distance as a proxy for species covariance

  • Covariance falls off with phylogenetic distance
    • path length between leaf nodes in tree (e.g. see dendrogram below)
  • Can use a Gaussian process in much the same way we did for island distances in the Oceanic Tools analysis

Kernel Function as proxy for evolutionary dynamics

Evolutionary model + tree structure = pattern of covariation observed at the tips

Common simple models of evolutionary dynamics:

  • Brownian motion - implies in linear covariance kernel function
  • Damped Brownian motion - implies an L1 / Ornstein-Uhlenbeck covariance kernel function

Phylogenetic "regression" model

BMVNormal(μ,K)K=η2exp(ρDij)Ornstein-Uhlenbeck kernelμi=α+βGGi+βMMiαNormal(0,1)βG,MNormal(0,1)η2HalfNormal(1,0.25)Maximum covariance priorρHalfNormal(3,0.25)Covariance decline rate prior\begin{align} \textbf{B} &\sim \text{MVNormal}(\mu, \textbf{K}) \\ \mathbf{K} &= \eta^2 \exp(-\rho D_{ij}) &\text{Ornstein-Uhlenbeck kernel}\\ \mu_i &= \alpha + \beta_G G_i + \beta_M M_i \\ \alpha &\sim \text{Normal}(0, 1) \\ \beta_{G, M} &\sim \text{Normal}(0, 1) \\ \eta^2 &\sim \text{HalfNormal}(1, 0.25) &\text{Maximum covariance prior} \\ \rho &\sim \text{HalfNormal}(3, 0.25) &\text{Covariance decline rate prior} \end{align}

Prior Samples Ornstein-Uhlenbeck Kernel Model

Distance-only model

  • Get the Gaussian-process part to work, then add details
  • Set $\beta_M = \beta_G = 0$

PyMC Implementation

Below is an implementation that uses PyMC's Gaussian process module.

  • In the Oceanic Tools models above, we didn't need the Gaussian process to have a mean function because all variables were standardized, thus we can use the default mean function of $0$
  • For phylogenetic regression with we no longer want a zero mean function, but we want the mean function that depends on $M$ and $G$, but not a function of the phylogenetic distance matrix. We can implement a custom mean function in pymc to handle this
  • Also, because our likelihood is a MVNormal, we can take advantage of conjugacy between the GP prior and the Normal likelihood, which returns a GP posterior. This means that we should use pm.gp.Marginal instead of pm.gp.Latent, which uses closed form solutions ot the posterior to speed up learning.

Below is an alternative implementation that builds the covariance function by hand, and directly models the dataset as a MVNormal with mean alpha and covariance defined by the kernel. I find that these direct MVNormal implementations track better with the lecture

Compare the Prior and Posterior for distance-only model

Compare the Prior and Posterior kernel functions

We can see comparing the prior and posterior distirbution over kernel functions, that the model has learned from the data and attenuated our expectation on maximum covariance.

Fit the full Phylogentic model.

This model stratifies by Group size $G$ and Body Mass $M$

Below is an implementation that uses PyMC's Gaussian process module.

Below is an alternative implementation that builds the covariance function by hand, and directly models the dataset as a MVNormal with mean being the linear function of $G$ and $M$, and covariance defined by the kernel. I find that these MVNormal implementations track better with the lecture.

Compare covariance terms in distance-only and full phylogenetic regression models

We can see that when including regressors for body mass and group size, the amount of phylogenetic covaration used to explain brain size is greatly diminished, compared to the phylogenetic-distance-only model.

The results for full-phylogenetic model reported here make sense to me too, given that we can explain away a lot of the variability in brain size by primarily body mass, after we've controlled for phylogenetic history; this in turn makes the phylogenetic history receive less weight in the model.

Influence of Group Size on Brain Size

  • We ignore the confounds due to genetic history (light gray)
    • thus we don't try to control for potential confounds due to genetic history.
  • To get the direct effect of $G$ on $B$ we'll also stratify by $M$ to block the backdoor path through the fork passing through $M$ .

By trying to control for potential confounds due to genetic history, we have nearly halved the estimated effect of Group size on Brain size $\beta_G$, though it's still mostly positive.

This estimator can also give us the direct effect of Body Mass $M$ on Brain Size $B$ (by blocking the Pipe through $G$). When we look at the $\beta_M$, we see that there is a slight increase in the effect of body mass on brain size when controlling for genetic history confounds, though the effect is large for both estimators indicating that body mass is a large driver of brain size.

Summary: Phylogenetic Regression

Potential Issues

  • What about uncertainty in phylogeny?
    • better to perform phylogenic inference simultaneously -- i.e. estimate a posterior on phylogenies
  • What about reciprocal causation?
    • feedback between organism and environment
    • there's no unidirectional cause
    • thus regression is likely not the best option
    • new were methods include differential equations to pose the problem as a multi-objective optimization problem

Summary: Gaussian Processes

  • Provides means for (local) partial pooling for continuous groups/categories
  • General approximation engine -- good for prediction
  • Robust to overfitting
  • Sensitive to priors: prior predictive simiulation is very important

Authors

  • Ported to PyMC by Dustin Stansbury (2024)
  • Based on Statistical Rethinking (2023) lectures by Richard McElreath

:::{include} ../page_footer.md :::