(lecture_16)=

Gaussian Processes

:::{post} Jan 7, 2024 :tags: statistical rethinking, bayesian inference, gaussian processes :category: intermediate :author: Dustin Stansbury :::

This notebook is part of the PyMC port of the Statistical Rethinking 2023 lecture series by Richard McElreath.

Video - Lecture 16 - Gaussian Processes# Lecture 16 - Gaussian Processes

Returning to Oceanic Technology Dataset

We'll now deal with some of the confounds we glossed over, now that we have some more advanced statistical tools

Total tools in society $T$ (outcome)
Populuation $P$ (treatment)
Contact level $C$
Unobserved confounds $U$, e.g.
- materials
- weather

There should be spatial covariation:

islands closer to one another should have similar unobserved confounds
islands should share historical context
we don't have to know the exact dymanics of the contexts, we can handle statistically at a macro scale

Review of Techonological Innovation/Loss model

Derived from steady state of difference equation $\Delta T = \alpha P^ \beta - \gamma T$
- $\alpha$ is rate of innovation
- $P$ is population
- $\beta$ is elasticity rate (diminishing returns)
- $\gamma$ is loss (e.g. forgetting) rate over time
Provides the expected number of tools over the long-run:
- $\hat T = \frac{\alpha P ^ \beta}{\gamma}$
- use $\lambda= \hat T$ as the expected rate parameter for a $\text{Poisson}(\lambda)$ distribution
How can we encode spatial covariance in this model?

Let's start of with an easier model that ignores population

\begin{align*} T_i &\sim \text{Poisson}(\lambda_i) \\ \lambda_i &= \bar \alpha + \alpha_{S[i]} \\ \begin{pmatrix} \alpha_1 \\ \alpha_2 \\ \vdots \\ \alpha_{10} \end{pmatrix} &= \text{MVNormal} \left( \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix}, \textbf{K} \right) \end{align*}

$\alpha_{S[i]}$ are the variable intercepts for each society
- These will have covariation based on their spatial distance from one another
- closer societies have similar offsets
$\mathbf{K}$ the covariance amongst societies
- models covariance as a function distance
- though $\mathbf{K}$ has many parameters (45 covariances), we should be able to leverage spatial regularity to estimate far fewer effective parameters. This is good b.c. we only have 10 data points 😅

Gaussian Processes (in the abstract)

Uses a kernel function to generate covariance matrices
- Requires far fewer parameters than the covariance matrix entries
- Can generate covariance matrices of any dimenions (i.e. "infinite dimensional generalization of the MNNormal")
- Can generate a prediction for any point
Kernel function calculates the covariance of two points based on metric of comparison, e.g.
- spatial distance
- difference in time
- difference in age

Different Kernel Functions

The Quadratic (L2) Kernel Function

aka RGB
aka Gaussian Kernel

K(x_i, x_j) = \eta^2 \exp \left(- \frac{(x_i - x_j)^2}{\sigma^2} \right)

Ornstein-Uhlenbeck Kernel

\begin{align*} K(x_i, x_j) &= \eta^2 \exp \left(- \frac{|x_i - x_j|}{\sigma} \right) \\ \sigma &= \frac{1}{\rho} \end{align*}

Periodic Kernel

K(x_i, x_j) = \eta^2 \exp \left(- \frac{\sin^2((x_i - x_j) / p)}{\sigma^2} \right)

with additional periodicity parameter, $p$

Gaussian Process Prior

Using the Gaussian/L2 Kernel Function to generate our covariance $\mathbf{K}$, we can adjust a few parameters

Varying $\sigma$ controls the bandwidth of the kernel function
- smaller $\sigma$ makes covariance fall off quickly with space
- larger $\sigma$ allow covariance to extend larger distances
Varying $\eta$ controls the maximum degree of covariance

1-D Examples

Below we draw functions from a 1-D Guassian process, varying either $\sigma$ or $\eta$ to demonstrate the effect of the parameters on the spatial correlation of the samples.

Varying $\sigma$

Varying $\eta$

Gaussian Process Posterior

As the model observes data, it adjusts the posterior to account for that data while also incorporating the smoothness constraints of the prior.

Distance-based model

\begin{align} T_i &\sim \text{Poisson}(\lambda_i) \\ \log(\lambda_i) &= \bar \alpha + \alpha_{S[i]} \\ \begin{pmatrix} \alpha_1 \\ \alpha_2 \\ \vdots \\ \alpha_{10} \end{pmatrix} &= \text{MVNormal}\left( \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix}, \textbf{K} \right) \\ k_{i,j} &= \eta^2\exp(-\rho^2D_{i,j}^2)) \\ \eta^2 &\sim \text{Exponential}(2) \\ \rho^2 &\sim \text{Exponential}(0.5) \end{align}

Islands Distance Matrix

Check model with prior-predictive simulation

Model the data

Posterior Predictions

Compare the Prior to the Posterior

Ensure that we've learned something

Stratify by population size

\begin{align} T_i &\sim \text{Poisson}(\lambda_i) \\ \lambda_i &= \frac{\bar \alpha P^\beta}{\gamma} \exp(\alpha_{S[i]}) \\ \begin{pmatrix} \alpha_1 \\ \alpha_2 \\ \vdots \\ \alpha_{10} \end{pmatrix} &= \text{MVNormal} \left( \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix}, \textbf{K} \right) \\ k_{i,j} &= \eta^2\exp(-\rho^2D_{i,j}^2)) \\ \bar \alpha, \beta, \gamma &\sim \text{Exponential}(1) \\ \eta^2 &\sim \text{Exponential}(2) \\ \rho^2 &\sim \text{Exponential}(0.5) \end{align}

We now fold our steady state equation into the intercept/baseline tools equation
Include an additional $\exp$ multiplier to handle society-level offsets. Idea here being that
- if $\alpha_{S[i]}=0$, then the society is average, and we fall back to the expected value of the difference equation i.e. multiplier is one
- if $\alpha_{S[i]}>0$ then there is a boost in tool production. This boost is shared amongst spatially-local societies

Prior Predictive

We can see that by including population, the degree of covariance required to explain tool use is diminished, compared to the distance-only model. i.e. population has explained away a lot more of the variation.

Fit population-only model for comparison

I think this is what McElreath is doing to construct the plot that combines population size, tool use and island covariance (next plot below). Specifically, we estimate a population-only model using the analytical solution, but includes no covariance amongst islands. We then compare add the covariance plot on top of that model to demonstrate that there is still some information left on the table regarding covariances, even when including population in the model.

Though Population alone accounts for a lot of the variance in tool use (blue curve), there's still quite a bit of residual island-island similarity that explains tool use (i.e. some of the black lines still have some weight).

Phylogenetic Regression

Dataset: Primates301

301 Primate species
Life history traits
Body Mass $M$, Brain Size $B$, Social Group Size $G$
Measurement error, unobserved confounding
Missing data, we'll fucus on the complete case analysis in this lecture

Filter out 151 Complete Cases

Drop osbservations missing either Body Mass, Brain Size, or Group size

There's lot's of Covariation amongst variables

There's no way from the sample alone to know what causes what

Causal Salad

Throwing all predictors into a model, then interpreting their coefficients causally
Using prediction methods as a stand-in for causality
Throwing all factors into "salad" of regression terms
- performing model selection base on prediction (we showed earlier in Lecture 07 - Fitting Over & Under [blocked] why this is bad to do)
- and interpreting the resulting coefficients as causal effects (we've shown why this is bad many times)
Controlling for phylogeny is important to do, but it's often done mindlessly (with the "salad" approach)
Regression with phylogeny still requires a causal model

Different Hypotheses required different causal models (DAGs)

Social Brain Hypothesis

Group Size $G$ Effects Brain Size $B$

Body Mass $M$ is a shared cause for both $M$ and $B$

This is the DAG we'll focus on in this Lecture

Body Mass $M$ is a mediator for both $M$ and $B$

Brain Size $B$ actually causes Group Size $G$; $M$ still a shared cause of both

Unobserved confounds make naive regression on traits w/out controlling for history is dangerous

History induces a number of unobserved shared confounds that we need to try to control for.
phylogenetic history can work as a proxy for these shared confounds.
We only have current measurements of outcome of history
- actual phylogenies do not exist
- different parts of a genome can have different histories

BUT, say that we do have an inferred phylogeny (as we do in this lecture), we use this phylogeny, in conjunction with Gaussian Process Regression to model the similarity amongst species as a proxy for shared history

Phylogenetic Regression

There's a long history of genetic evolution
We only get to measure the current end state of that process
In principle, common genetic histories should induce covariation amongst species
We should hopefully be able to model this covariation to average over all the possible histories (micro state) that could have led to genetic similarity (macro state)

Two conjoint problems

What is the history (phylogeny)?
How do we use it to model causes and control for confounds?

1. What is the history (phylogeny)?

How do we estimate or identify a phylogeny?

Difficulties

high degree of uncertainty
processes are non-stationary
- different parts of the genome have different histories
- crossing-over effects
available inferential tools
- exploring tree space is difficult
- repurposing software from other domains for studying biology
phylogenies (just like social networks) do not exist.
- They are abstractions we use to capture regularities in data

2. How do we use it to model causes?

...say we've obtained a phylogeny, now what?

Approaches

no default approach
Gaussian Processes are default approach
- use phylogeny as proxy for "genetic distance"

Two equivalent formulations of Linear Regression

You can always re-express a linear regression with a draw from a multi-variate Normal distribution

Classical Linear Regression

The dataset $\textbf{B}$ is modeled as $D$ independent samples $B_i$ from a random a univariate normal.

\begin{align} B_i &\sim \text{Normal}(\mu_i, \sigma^2) \\ \mu_i &= \alpha + \beta_G G_i + \beta_M M_i \\ \alpha &\sim \text{Normal}(0, 1) \\ \beta_{G, M} &\sim \text{Normal}(0, 1) \\ \sigma &\sim \text{Exponential}(1) \end{align}

$\text{MVNormal}$ Formulation

The dataset is modeled as a single $D$-vector $\textbf{B}$ sampled from a $\text{MVNormal}$. Has more of a Gaussian-process flavor to it; allows us to formulate the Linear regressoin in order to include covariation amongst the outputs.

\begin{align} \textbf{B} &\sim \text{MVNormal}(\mu, \textbf{K}) \\ \mathbf{K} &= \mathbf{I}\sigma^2 \\ \mu_i &= \alpha + \beta_G G_i + \beta_M M_i \\ \alpha &\sim \text{Normal}(0, 1) \\ \beta_{G, M} &\sim \text{Normal}(0, 1) \\ \sigma &\sim \text{Exponential}(1) \end{align}

Classic Linear Regression

$\text{MVNormal}$ Linear Regression

Verify the two formulations provide the same parameter estimates

We're able to recover the same estimates as in the lecture

From Model to Kernel

We'd like to incorporate some residual $u_i$ into our linear regression that adjusts the expecation in a way that encodes the shared history species

\begin{align} \textbf{B} &\sim \text{MVNormal}(\mu, \textbf{K}) \\ \mathbf{K} &= \mathbf{I}\sigma^2 \\ \mu_i &= \alpha + \beta_G G_i + \beta_M M_i + u_i\\ \alpha &\sim \text{Normal}(0, 1) \\ \beta_{G, M} &\sim \text{Normal}(0, 1) \\ \sigma &\sim \text{Exponential}(1) \end{align}

Phylogenetic distance as a proxy for species covariance

Covariance falls off with phylogenetic distance
- path length between leaf nodes in tree (e.g. see dendrogram below)
Can use a Gaussian process in much the same way we did for island distances in the Oceanic Tools analysis

Kernel Function as proxy for evolutionary dynamics

Evolutionary model + tree structure = pattern of covariation observed at the tips

Common simple models of evolutionary dynamics:

Brownian motion - implies in linear covariance kernel function
Damped Brownian motion - implies an L1 / Ornstein-Uhlenbeck covariance kernel function

Phylogenetic "regression" model

\begin{align} \textbf{B} &\sim \text{MVNormal}(\mu, \textbf{K}) \\ \mathbf{K} &= \eta^2 \exp(-\rho D_{ij}) &\text{Ornstein-Uhlenbeck kernel}\\ \mu_i &= \alpha + \beta_G G_i + \beta_M M_i \\ \alpha &\sim \text{Normal}(0, 1) \\ \beta_{G, M} &\sim \text{Normal}(0, 1) \\ \eta^2 &\sim \text{HalfNormal}(1, 0.25) &\text{Maximum covariance prior} \\ \rho &\sim \text{HalfNormal}(3, 0.25) &\text{Covariance decline rate prior} \end{align}

Prior Samples Ornstein-Uhlenbeck Kernel Model

Distance-only model

Get the Gaussian-process part to work, then add details
Set $\beta_M = \beta_G = 0$

PyMC Implementation

Below is an implementation that uses PyMC's Gaussian process module.

In the Oceanic Tools models above, we didn't need the Gaussian process to have a mean function because all variables were standardized, thus we can use the default mean function of $0$
For phylogenetic regression with we no longer want a zero mean function, but we want the mean function that depends on $M$ and $G$, but not a function of the phylogenetic distance matrix. We can implement a custom mean function in pymc to handle this
Also, because our likelihood is a MVNormal, we can take advantage of conjugacy between the GP prior and the Normal likelihood, which returns a GP posterior. This means that we should use pm.gp.Marginal instead of pm.gp.Latent, which uses closed form solutions ot the posterior to speed up learning.

Below is an alternative implementation that builds the covariance function by hand, and directly models the dataset as a MVNormal with mean alpha and covariance defined by the kernel. I find that these direct MVNormal implementations track better with the lecture

Compare the Prior and Posterior for distance-only model

Compare the Prior and Posterior kernel functions

We can see comparing the prior and posterior distirbution over kernel functions, that the model has learned from the data and attenuated our expectation on maximum covariance.

Fit the full Phylogentic model.

This model stratifies by Group size $G$ and Body Mass $M$

Below is an implementation that uses PyMC's Gaussian process module.

Below is an alternative implementation that builds the covariance function by hand, and directly models the dataset as a MVNormal with mean being the linear function of $G$ and $M$, and covariance defined by the kernel. I find that these MVNormal implementations track better with the lecture.

Compare covariance terms in distance-only and full phylogenetic regression models

We can see that when including regressors for body mass and group size, the amount of phylogenetic covaration used to explain brain size is greatly diminished, compared to the phylogenetic-distance-only model.

The results for full-phylogenetic model reported here make sense to me too, given that we can explain away a lot of the variability in brain size by primarily body mass, after we've controlled for phylogenetic history; this in turn makes the phylogenetic history receive less weight in the model.

Influence of Group Size on Brain Size

We ignore the confounds due to genetic history (light gray)
- thus we don't try to control for potential confounds due to genetic history.
To get the direct effect of $G$ on $B$ we'll also stratify by $M$ to block the backdoor path through the fork passing through $M$ .

By trying to control for potential confounds due to genetic history, we have nearly halved the estimated effect of Group size on Brain size $\beta_G$, though it's still mostly positive.

This estimator can also give us the direct effect of Body Mass $M$ on Brain Size $B$ (by blocking the Pipe through $G$). When we look at the $\beta_M$, we see that there is a slight increase in the effect of body mass on brain size when controlling for genetic history confounds, though the effect is large for both estimators indicating that body mass is a large driver of brain size.

Summary: Phylogenetic Regression

Potential Issues

What about uncertainty in phylogeny?
- better to perform phylogenic inference simultaneously -- i.e. estimate a posterior on phylogenies
What about reciprocal causation?
- feedback between organism and environment
- there's no unidirectional cause
- thus regression is likely not the best option
- new were methods include differential equations to pose the problem as a multi-objective optimization problem

Summary: Gaussian Processes

Provides means for (local) partial pooling for continuous groups/categories
General approximation engine -- good for prediction
Robust to overfitting
Sensitive to priors: prior predictive simiulation is very important

Authors

Ported to PyMC by Dustin Stansbury (2024)
Based on Statistical Rethinking (2023) lectures by Richard McElreath

:::{include} ../page_footer.md :::

16-Gaussian_Processes.ipynb

Gaussian Processes

Returning to Oceanic Technology Dataset

Review of Techonological Innovation/Loss model

Let's start of with an easier model that ignores population

Gaussian Processes (in the abstract)

Different Kernel Functions

The Quadratic (L2) Kernel Function

Ornstein-Uhlenbeck Kernel

Periodic Kernel

Gaussian Process Prior

1-D Examples

Varying $\sigma$

Varying $\eta$

Gaussian Process Posterior

Distance-based model

Islands Distance Matrix

Check model with prior-predictive simulation

Model the data

Posterior Predictions

Compare the Prior to the Posterior

Stratify by population size

Prior Predictive

Fit population-only model for comparison

Phylogenetic Regression

Dataset: Primates301

Filter out 151 Complete Cases

There's lot's of Covariation amongst variables

Causal Salad

Different Hypotheses required different causal models (DAGs)

Social Brain Hypothesis

Body Mass $M$ is a shared cause for both $M$ and $B$

Body Mass $M$ is a mediator for both $M$ and $B$

Brain Size $B$ actually causes Group Size $G$; $M$ still a shared cause of both

Unobserved confounds make naive regression on traits w/out controlling for history is dangerous

Phylogenetic Regression

Two conjoint problems

1. What is the history (phylogeny)?

Difficulties

2. How do we use it to model causes?

Approaches

Two equivalent formulations of Linear Regression

Classical Linear Regression

$\text{MVNormal}$ Formulation

Classic Linear Regression

$\text{MVNormal}$ Linear Regression

Verify the two formulations provide the same parameter estimates

From Model to Kernel

Phylogenetic distance as a proxy for species covariance

Kernel Function as proxy for evolutionary dynamics

Evolutionary model + tree structure = pattern of covariation observed at the tips

Phylogenetic "regression" model

Prior Samples Ornstein-Uhlenbeck Kernel Model

Distance-only model

PyMC Implementation

Compare the Prior and Posterior for distance-only model

Compare the Prior and Posterior kernel functions

Fit the full Phylogentic model.

Compare covariance terms in distance-only and full phylogenetic regression models

Influence of Group Size on Brain Size

Summary: Phylogenetic Regression

Potential Issues

Summary: Gaussian Processes

Authors