(lecture_10)=

Counts and Hidden Confounds

:::{post} Jan 7, 2024 :tags: statistical rethinking, bayesian inference, probability :category: intermediate :author: Dustin Stansbury :::

This notebook is part of the PyMC port of the Statistical Rethinking 2023 lecture series by Richard McElreath.

Video - Lecture 10 - Counts and Hidden Confounds# Lecture 10 - Counts and Hidden Confounds

Revisiting Generalized Linear Models

  • Expected value is some function of an additive combination of parameters
    • That function tends to be tied to the data Likelihood distribution -- e.g. Identity for the Normal distribution (linear regression) or the log odds for Bernoulli/Binomial distribution (logistic regression)
  • Uniform changes in predictors are not generally associated with uniform changes in outcomes
  • Predictor variables interact -- causal intererpretation of coefficients (outside of simplest models) is fraught with misleading conclusions

Confounded Admissions

  • We estimated Direct and Total effect of Gender on Admission rates in order to identify different flavors of gender discrimination in admissions
  • However, it's implausible that there are no uobserved confounds between variables,
    • e.g. Applicant ability could link Department to Admission rate
      • affects which students apply to each department (more ability biases department application)
      • also affect baseline admission rates (more ability leads to higher admission rates)

Generative Simulation

Total Effect Estimator

  • Estimating Total effect requires no adjustment set, model gender only in the GLM
AiBernoulli(p=pi)logit(pi)=α[Gi]α=[α0,α1]αjNormal(0,1)\begin{align*} A_i &\sim \text{Bernoulli}(p=p_i) \\ \text{logit}(p_i) &= \alpha[G_i] \\ \alpha &= [\alpha_0, \alpha_1] \\ \alpha_j &\sim \text{Normal}(0, 1) \end{align*}

Fit the Total Effect Model

Summarize the Total Effect Estimates

Direct Effect Estimator (now confounded due to common ability cause)

  • Estimating Direct effect includes Department in the adjustment set
  • However, stratifying by Department (collider) opens a confounder backdoor path through unobserved ability, u
AiBernoulli(p=pi)logit(pi)=α[Di,Gi]α=[α0,0,α0,1α1,0,α1,1]αj,kNormal(0,1)\begin{align*} A_i &\sim \text{Bernoulli}(p=p_i) \\ \text{logit}(p_i) &= \alpha_{[D_i, G_i]} \\ \alpha &= \begin{bmatrix} \alpha_{0,0}, \alpha_{0,1} \\ \alpha_{1,0}, \alpha_{1,1} \end{bmatrix} \\ \alpha_{j,k} &\sim \text{Normal}(0, 1) \end{align*}

Fit the (confounded) Direct Effect Model

Summarize the (confounded) Direct Effect Estimates

Interpreting the (confounded) Direct Effect Estimates

Interpreting the Confounded Direct Effect Model

  • In D1, G1 appears to be disatvantaged, with a lower admission rate. However, we know this isn't true. What's happening is that all the higher-ability G1 applicants are being sent to D2, thus artificially lowering the G1 acceptance rate in D1
  • We know that there is bias in D2, however, we see little evidence for discrimination. This is due to higher-ability G1 applicatins ofsetting this bias by having higher-than-average acceptance
  • We can see that G1 estimates for D2 have higher variance; this is due to there being only 10% of applicants having high ability, thus fewer G1 applicants overall apply to D2.

You guessed it: Collider Bias

  • This is due to collider bias
    • stratifying by Department--which forms a colider with Gender and ability--opens a path through the ability to acceptance.
    • You CANNOT estimate Direct effect of $D$ on $G$
    • You CAN estimate the Total effect
  • sorting can mask or accentuate bias

Analogous Example: NAS Membership & Citations

Two papers

  • same data
  • find drastically different conclusions about Gender and its effect on admission to the National Academy of Sciences (NAS)
    • One found Women are strongly advantaged
    • The other found woemen strongly disadvantaged
  • How can both conclusions be true?
  • There are likely latent Researcher quality difference
  • Stratifying by number of citations opens up a collider bias with unobserved Researcher quality Citations is a post-treatment variable
  • Citation-stratification provide misleading conclusions
  • e.g. if there is discrimination in publication/citation, one gender may get elected at a higher rate just because they will have higher quality on on average for any citation level

No Causes in, no causes out

These papers suffere from a number of shortcomings

  • vague estimands
  • unwise adjustment sets
  • requires stronger assumptions than presented
  • collider bias could affect policy design in a bad way
  • qualitative data can be useful in these circumstances

Sensitivity Analysis: Modeling latent ability confound variable

What are the implications of things we can't measure?

Similar to Direct Effect scenario

  • Estimatinge Direct effect includes Department in the adjustment set
  • However, stratifying by Department (collider) opens a confounder backdoor path through unobserved ability, u

Though we can't directly measure a potential confound, what we can do is simulate the degree of the effect of a potential confound. Specifically, we can set up a simulation where we create a random variable associated with the potential confound, then weight the amount of contribution that confound has on generating the observed data.

In this particular example, we can simulate the degree of effect of an ability random variable $U \sim \text{Normal}(0, 1)$, by adding a linearly weighted contribution of that variable to the log odds of Acceptance and selecting a department (this is because we u affect both D and A in our causal graph):

Department submodel

DiBernouilli(qi)logit(qi)=δ[Gi]+γG[i]uiδ[Gi]Normal(0,1)ujNormal(0,1)\begin{align*} D_i &\sim \text{Bernouilli}(q_i) \\ logit(q_i) &= \delta[G_i] + \gamma_{G[i]} u_i \\ \delta[G_i] &\sim \text{Normal}(0, 1) \\ u_j &\sim \text{Normal}(0, 1) \end{align*}

Acceptance submodel

AiBernouilli(pi)logit(pi)=α[Gi,Di]+βG[i]uiα[Gi,Di]Normal(0,1)ujNormal(0,1)\begin{align*} A_i &\sim \text{Bernouilli}(p_i) \\ logit(p_i) &= \alpha[G_i, D_i] + \beta_{G[i]} u_i \\ \alpha[G_i, D_i] &\sim \text{Normal}(0, 1) \\ u_j &\sim \text{Normal}(0, 1) \end{align*}

Where we manually set the value of $\beta_{G[i]}$ and $\gamma_{G[i]}$ by hand to perform the simulation

Fit the latent ability model

Summarize the latent ability estimate

Interpreting the Effect of modeling the confound

By adding sensitivity analysis that is aligned with the data-generating process, we are able to identify gender bias in department 2

Review of Sensitivity Analysis

  • Confounds exist, event if we can't measure them directly -- don't simply pretend that they don't exist
  • Address the question: What are the implications of what we don't know?
  • SA is somewhere between simulation and analysis
    • Hard-coding what we don't know, and let the rest play out.
    • Vary the confound over a range (e.g. std deviations) and show how that change effects the estimate
  • More honest than pretending that confounds do not exist.

Note on number of parameters 🤯

  • Sensitivity Analysis model has 2006 free parameters
  • Only 2000 observations
  • No biggie in Bayesian Analysis
    • The minimum sample size is 0, where we just fall back on the prior

Counts and Poisson Regression

Kline & Boyd Oceanic Technology Dataset

How is technological complexity in a society related to population size?

Estimand: Influence of population size and contact on total tools

Conceptual Ideas

  • The more innnovation the more tools
  • The more people, the more innovation
  • The more contact between cultures, the more innovation
  • Innovations (tools) are also forgotten over time, or become obsolete

Scientific Model

  • Poplulation is treatment
  • Tools is outcome
  • Contact level moderates effect of Population (Pipe)
  • Location is unobserved confound
    • better materials
    • proximity to other cultures
    • can support larger populations
    • we'll ignore for now

Adjustment set for Direct effect of Population on Tools

  • Location, if it were observed
    • Also stratify by contact to study interactions

Modeling total tools

  • There's no upper limit on tools --> can't use Binomial
  • Poisson Distribution approaches Binomial for large $N$ (approaching infinity) and low $p$ (approching 0)

Poisson GLM

YiPoisson(λi)log(λi)=α+βxi\begin{align*} Y_i &\sim \text{Poisson}(\lambda_i) \\ \log(\lambda_i) &= \alpha + \beta x_i \end{align*}
  • link function is $\log(\lambda)$
  • inverse link function is $\exp(\alpha + \beta x_i)$
  • strictly positive $\lambda$ (due to exponential)

Poisson Priors

  • Be careful with Exponential scaling, it can give shocking results! Usually long tails result
  • Easier to shift the location (e.g. a the mean of a Normal prior), and keep tight variances
  • Prior variances generally need to be quite tight, on the order of 0.1 - 0.5

Adding a Slope to the mix $\log(\lambda_i) = \alpha + \beta x_i$

Expected Count functions drawn from two different types of priors

  • Left: wide priors
  • Right: tight priors

Direct Effect Estimator

  • Estimatinge Direct effect of population P on number of tools T includes contact level, C in the adjustment set and location, L (currently unobserved; we will ignore for now)

Comparing Models

We'll estimate a couple of models in order to practice model comparison

  • Model A) A simple, global intercept Poisson GLM
  • Model B) A Poisson GLM that includes intercept and parameter for the standardized log-population, both of which are stratified by contact level

Model A - Global Intercept model

Here we model tools count as a Poisson random variable. The poisson rate parameter is the exponent of a linear model. In this linear model, we include only an offset for low- or high- contact populations.

TiPoisson(λi)λi=αC[i]αNormal(3,0.5)\begin{align*} T_i &\sim \text{Poisson}(\lambda_i) \\ \lambda_i &= \alpha_{C[i]} \\ \alpha &\sim \text{Normal}(3, 0.5) \\ \end{align*}

Model B - Interaction model

Here we bracket both the intercept and the population regression coefficient by contact level

TiPoisson(λi)λi=αC[i]+βC[i]log(P)αNormal(3,0.5)βNormal(0,0.2)\begin{align*} T_i &\sim \text{Poisson}(\lambda_i) \\ \lambda_i &= \alpha_{C[i]} + \beta_{C[i]} \log(P) \\ \alpha &\sim \text{Normal}(3, 0.5) \\ \beta &\sim \text{Normal}(0, 0.2) \\ \end{align*}

Model Comparisons

pPSIS discussed in the lecture is analogous to p_loo in the output above

Posterior Predictions

The model above is "wack" for the following reasons:

  1. The low-contact mean intersects with the high-contact mean (around population of 150k). This makes little since logically
  2. The intercept for population = 0 should be very near 0. It's instead around 20 and 30 for both groups.
  3. Tonga isn't even included in the 89% HDI for the hi contact group
  4. Error for hi-contact group is absurdly large for a majority of the population range

Can we do better?

Improving the estimator with a better scientific model

There are two immediate to improve the model, including:

  1. Robust regression model -- in this case a gamma-Poisson (neg-binomial) model
  2. Use a more prinicipled scientific model

Scientific model that includes innovation and technology loss

Recall from earlier this DAG that highlights the general conceptual idea of how observeed tools can arise:

Why not develope a sicentific model that does just that?

Using the difference equation: $\Delta T = \alpha P^\beta - \gamma T$

  • $\Delta T$ is the change in # of tools given the current number of tools. Here T can be thought of as # of tools at current generation
  • $\alpha$ is the innovation rate for a population of size $P$
  • $\beta$ is the elasticity, and can be thought of as a saturation rate, or "diminishing returns" factor; if we constrain $0 > \beta < 1$
  • $\gamma$ is the attritions / technology loss rate at time T

Furthermore we can parameterize such an equation by the class of contact rate, $C$ as $\Delta T = \alpha_C P^{\beta_C} - \gamma T$

Now we leverage the notioin of equilibrium identify the steady state # of tools that are eventually obtained. At this point $\Delta T = 0$, and we can solve for the resulting $\hat T$ using algebra:

ΔT=αPβγT^=0γT^=αPβT^=αCPβCγ\begin{align*} \Delta T &= \alpha P^\beta - \gamma \hat T = 0 \\ \gamma \hat T &= \alpha P^\beta\\ \hat T &= \frac{\alpha_C P^{\beta_C}}{\gamma} \end{align*}

Simulate the difference equation for various societies

Innovation / Loss Statistical Model

  • Use $\hat T = \lambda$ as the expected number of tools, i.e. $T \sim \text{Poisson}(\hat T)$
  • Note: we must constrain $\lambda$ to be positive, which we can do in a couple of ways:
    1. Exponentiate variables
    2. Use appropriate priors that constrain the variables to be positive (we'll use this approach)
TPoisson(T^i)T^i=αC[i]PβC[i]γαj,βj,γExponential(η)\begin{align*} T &\sim \text{Poisson}(\hat T_i) \\ \hat T_i &= \frac{\alpha_{C[i]} P^{\beta_{C[i]}}}{\gamma} \\ \alpha_j, \beta_j, \gamma &\sim \text{Exponential}(\eta) \end{align*}

Determine good prior hyperparams

We'll use an Exponential distribution as a prior on the difference equation parameters $\alpha, \beta, \gamma$. We thus need to identify a good rate hypoerparmeter $\eta$ for those priors.

Reasonable values for all parameters were were approximately 0.25 in the simulation above. We would thus like to identify the the Exponential rate parameter that covers 0.25 = 1/4.

We can see that for the $\alpha$ and $\beta$ parameters, the optimal value was around 0.23-0.28. For gamma, it was a bit smaller, at around 0.09. We could potentially re-parameterize our model to have a tigher prior for the Gamma variable, but meh.

Posterior predictions

Notice the following improvements over the basic interaction model

  • No weird crossover of low/high contact trends
  • zero population now associated with zero tools

Model Comparisons

We can also see that the innovation / loss model is far superior (weight=.94) in terms of LOO prediction.

Take-homes

  • Generally best to have a domain-informed scientific model
  • We still have the unobserved location confound to deal with

Review: Count GLMS

  • MaxEnt priors
    • Binomal

    • Poisson & Extensions

    • log link function; exp inverse link function

  • Robust Regression
    • Beta-Binomial
    • Gamma-Poisson

BONUS: Simpons's Pandora's Box

The reversal of some measured/estimated association when groups are either combined or separated.

  • There is nothing particularly interesting or Paradoxical about Simpson's paradox
  • It is simply a statistical phenomena
  • There are any number of causal phenoena that can create SP
    • Pipes and Forks can cause one flavor of SP -- namely stratifying destroys trend / associations
    • Collider can cause the inverse flavor -- nameley stratifying ellicits a trend / association
  • You can't say one way or the other which direction of "reversal" is correct without making explicit causal claims

Classic example is UC Berkeley Admissions

  • If you do not stratify/condition on Department, you find that Females are Admitted at a lower rate
  • If you stratify/condition on Department, you find that Females are Admitted at a slightly higher rate (see above Admissions analyses)
  • Which is correct? Could be explained by either:
    • a mediator/pipe (department)
    • a collider + confound (unobserved ability)

**For examples of how Pipes, Forks, and Colliders can "lead to" Simpson's paradox, see Leture 05 -- Elemental Confounds [blocked] **

Nonlinear Haunting

Though $Z$ is not a confound, it is an competing cause of $Y$. If the causal model is nonlinear and we stratify by $Z$ to get the direct causa effect of the treatment on the outcome, this can cause some strange outcomes akin to Simpson's paradox.

Example: Base Rate Differences

Here we simulate data where $X$ and $Z$ are independent, but $Z$ has a nonlinear causal effect on $Y$

Generative Simulation

Unstratified Model -- $\text{logit}(p_i) = \alpha + \beta X_i$

Partially Stratified Model -- $\text{logit}(p) = \alpha + \beta_{Z[i]} X_i$

Unstratified model posterior predictions

Partially Stratified model posterior predictions

Plot the effect of stratification

When stratifying only on the X coefficient, and thus sharing a common intercept, we can see that for Z=0, there is a saturation around 0.6. This is due to the +5 added to the log odds of Y|Z=0 in the logistic regression model. Because of this saturation, it's difficult to tell if the treatment affects the outcome for that group.

Try a fully-stratified model -- $\text{logit}(p_i) = \alpha_{Z[i]} + \beta_{Z[i]}X_i$

Include a separate intercept for each group

Fullly Stratified Model posterior predictions

Here we can see that with a fully stratified model, one in which we include a group-level intercept, the predicitions for Z=0 shift up even higher toward one, though the predictions remain mostly flat across all values of the treatment X

Compare Posteriors

Simpson's Paradox Summary

  • No paradox, almost anything can produce SP
  • Coefficient reversals have little interpretive value outside of causal framework
  • Don't focus on coefficients: push predictions through model to compare
  • Random note: you can't accept the NULL, you can only reject it.
    • Just because a distribution overlaps 0 doesn't mean it's zero

Authors

  • Ported to PyMC by Dustin Stansbury (2024)
  • Based on Statistical Rethinking (2023) lectures by Richard McElreath

:::{include} ../page_footer.md :::