(lecture_09)=

Modeling Events

:::{post} Jan 7, 2024 :tags: statistical rethinking, bayesian inference, logistic regression :category: intermediate :author: Dustin Stansbury :::

This notebook is part of the PyMC port of the Statistical Rethinking 2023 lecture series by Richard McElreath.

Video - Lecture 09 - Modeling Events# Lecture 09 - Modeling Events

UC Berkeley Admissions

Dataset

  • 4562 Graduate student applications
  • Stratified by
    • Department
    • Gender

Goal is to identify gender discrimination by admissions officers

Modeling Events

  • Events: discrete, unordered outcomes
  • Observations are counts/aggregates
  • Unknowns are probabilities $p$ of event ocurring, or odds of those probabilities $\frac{p}{1-p}$
  • All things we stratify by interact -- generally never independent in real life
  • Often deal with the Log Odds of $p = \log \left (\frac{p}{1-p} \right)$

Admissions: Drawing the owl 🦉

  1. Estimands(s)
  2. Scientific Models(s)
  3. Statistical Models(s)
  4. Analysis

1. Estimand(s)

Which path defines "discrimination"

Direct Discrimination (Causal Effect)

  • aka "Institutional discrimination"
  • Referees are biased for or against a particular group
  • Usually the type we're interested in identifying if it exists
  • Requires strong causal assumptions

Here, deparment, D is a mediator -- this is a common structure in social sciences, where categorical status (e.g. gender) effects some mediating context (e.g. occupation), both of which affect a target outcome (wage). Examples

  • wage discrimination
  • hiring
  • scientific awards

Indirect Discrimination

  • aka "Structural discrimination"
  • e.g. Gender affects a person's interests, and therefore the department they will apply to
  • Affects overall admission rates
  • Requires strong causal assumptions

Total Discrimination

  • aka "Experienced discrimination"
  • through both direct and indirect routes
  • Requires mild assumptions

Estimands & Estimators

  • Each of the different estimands require a different estimators
  • Often the thing we can estimate often isn't what we want to estimate
  • e.g. Total discrimination may be easier to estimate, but is less actionable

Unobserved Confounds

It's always possible there are also confounds between the mediator and some unobserved confounds. We will ignore these for now.

2. Scientific Model(s):

Simulate the process

Below is a generative model of the review/admission process

Simulated data properties

Both scenarios

  • Gender 0 tends to apply to department 0
  • Gender 1 tends to apply to department 1

Unbiased scenario:

  • due to lower acceptance rates in dept 0 and tendency of gender 0 to apply to dept 0, gender 0 has a lower overall rejection rate compared to gender 1
  • due to higher acceptance rates in dept 1 and tendency of gender 1 to apply to dept 1, gender 1 has a higher overall acceptance rate compared to gender 0
  • even in the case of no (direct) gender discrimination, there is still indirect discrimination based on tendency of genders to apply to different departments, and the unqual likelihood that each department accepts students.

Biased scenario

  • overall acceptance rates are lower (due to baseline reduction in gender 0 acceptance rates across both departments)
  • in the scenario where there is actual department bias, we see a similar overall pattern of discrimination as the unbiased case due to the indirect effect.

3. Statistical Models

Modeling Events

  • We observe counts of events
  • We estimate probabilities -- or, rather, the log-odds of events ocurring

Linear Models

Expected value is linear (additive) combination of parameters

YiN(μi,σ)μi=α+βXXi+...\begin{align*} Y_i &\sim \mathcal{N}(\mu_i, \sigma) \\ \mu_i &= \alpha + \beta_X X_i + ... \end{align*}

b.c. Normal distribution is unbounded, so too is the expected value of the linear model.

Event Models

Discrete events either occur, taking the value 1, or they do not, taking the value 0. This puts bounds on the expected value of an event. Namely the bounds are on the interval $(0, 1)$

Generalized Linear Models & Link Functions

  • Expected value is some function $f(\mathbb{E}[\theta])$ of an additive combination of the parameters.
f(E[θ])=α+βXXi+...f(\mathbb{E}[\theta]) = \alpha + \beta_X X_i + ...
  • Generally, this function $f()$, called the link function, will have a specific form that is associated with the likelihood distribution.
  • the link function will have an inverse link function $ f^{-1}(X_i, \theta)$ such that
  • to reiterate link functions are matched to distributions
E[θ]=f1(α+βXXi+...)\mathbb{E}[\theta] = f^{-1}(\alpha + \beta_X X_i + ...)
  • In the linear regression case the likelihood model is Gaussian, and the associated link function (and inverse link function) is just the identity, $I()$. (left plot below)

Logit Link Function and Logisitic Inverse function

Binary outcomes

  • Logistic regression used to model the probability of an event ocurring
  • Likelihood function is the Bernouilli
  • Associated link function is the log-odds, or logit: $f(p) = \frac{p}{1-p}$
YiBernoulli(pi)f(pi)=α+βXXi+...\begin{align*} Y_i &\sim \text{Bernoulli}(p_i) \\ f(p_i) &= \alpha + \beta_X X_i + ... \end{align*}
  • the inverse link function is the inverse logit aka logistic function (right plot above): $f^{-1}(X_i, \theta) = \frac{1}{1 + e^{-(\alpha + \beta_X X_i + ...)}}$
    • defining $\alpha + \beta_X X + ... = q$ (ignoring the data index $i$ for now), the derivation of the inverse logit is as follows
log(p1p)=α+βXX+...=qp1p=eqp=(1p)eqp+peq=eqp(1+eq)=eqp=eq(1+eq)p=eq(1+eq)eqeqp=1(1+eq)\begin{align} \log \left(\frac{p}{1-p}\right) &= \alpha + \beta_X X + ... = q \\ \frac{p}{1-p} &= e^{q} \\ p &= (1-p) e^{q} \\ p + p e^{q} &= e^{q} \\ p(1 + e^{q}) &= e^{q} \\ p &= \frac{ e^{q}}{(1 + e^{q})} \\ p &= \frac{ e^{q}}{(1 + e^{q})} \frac{e^{-q}}{e^{-q}}\\ p &= \frac{1}{(1 + e^{-q})} \\ \end{align}

logit is a harsh transform

Interpreting the log odds can be difficult at first, but in time becomes easier

  • log-odds scale
  • $\text{logit}(p)=0, p=0.5$
  • $\text{logit}(p)=-\infty, p=0$
    • rule of thumb: $\text{logit}(p)<-4$ means event is unlikely (hardly ever)
  • $\text{logit}(p)=\infty, p=1$
    • rule of thumb: $\text{logit}(p)>4$ means event is very likely (nearly always)

Bayesian Updating for Logistic Regression

For the following simulation, we'll use a custom utility function utils.simulate_2_parameter_bayesian_learning for simulating general Bayeisan posterior update simulation. Here's the API for that function (for more details see utils.py)

Priors for logistic regression

logit link function is a harsh transform

  • Logit compresses parameter distributions
  • $x > +4 \rightarrow $ event basically always occurs
  • $x < -4 \rightarrow$ event basically never occurs

Prior Predictive Simulations

Statistical models for admissions

Again, the estimator will depend on the estimand

Total Causal Effect

Stratify by only Gender. Don't stratify by Department b.c. it's a Pipe (mediator) that we do not want to block

AiBernoulli(p=pi)logit(pi)=α[Gi]α=[α0,α1]αjNormal(0,1)\begin{align*} A_i &\sim \text{Bernoulli}(p=p_i) \\ \text{logit}(p_i) &= \alpha[G_i] \\ \alpha &= [\alpha_0, \alpha_1] \\ \alpha_j &\sim \text{Normal}(0, 1) \end{align*}

Direct Causal Effect

Stratify by Gender and Department to block the Pipe

AiBernoulli(pi)logit(pi)=α[Gi,Di]α=[α0,0,α0,1α1,0,α1,1]αj,kNormal(0,1)\begin{align*} A_i &\sim \text{Bernoulli}(p_i) \\ \text{logit}(p_i) &= \alpha[G_i, D_i] \\ \alpha &= \left[{\begin{array}{cc} \alpha_{0,0}, \alpha_{0,1} \\ \alpha_{1,0}, \alpha_{1,1} \end{array}}\right] \\ \alpha_{j,k} &\sim \text{Normal}(0, 1) \end{align*}

Fitting Total Causal Effect Model

Fit Total Effect model on BIASED simulated admissions

Fit Total Effect model on UNBIASED simulated admissions

Fitting Direct Causal Effect Model

Fit Direct Effect model to BIASED simulated admissions data

For comparison, here's the ground truth biased admission rates, which we're able to mostly recover:

Fit Direct Effect model to UNBIASED simulated admissions data

For comparison, here's the ground truth unbiased admission rates, which were able to recover:

4. Analyze the UC Berkeley Admissions data

Review of the counts dataset

  • we'll use these counts data to model log odds of acceptance rate for gender/department
  • note that we'll be using Binomial Regression, which is equivalent to Bernoulli regression, but operates on aggregate counts, as oppose to individual binary trials
    • The examples above were actually implemented as Binomial Regression so that we can re-use code and demonstrate general patterns of analysis
    • Either way, you'll get the same inference using both approaches

Fit Total Effect model to UC Berkeley admissions data

Don't forget to look at diagnostics, which we'll skip here

Total Causal Effect

Fit Direct Effect model to UC Berkeley admissions data

Direct Causal Effect

Average Direct Effect

  • The causal effect of gender averaged across departments (marginalize)
  • Depends on the distribution of applications to each deparment
  • This is easy to do as a simulation

Post-stratification

  • Re-weight estimates for the target population
  • allows us to apply model fit from one university to estiamte causal impact at another university with a different distribution of departments
  • Here, we use the empirical distribution for re-weighting estimates

To verify the averaging process, we can look at the contrast of the p_accept samples from the posterior, which provides similar results. However, looking ath the posterior obviously wouldn't work for making predictions for an out-of-sample university however.

Discrimination?

Hard to say

  • Big structural effects
  • Distribution of applicants could be due to discrimination
  • Confounds are likely

BONUS: Survival Analysis

  • Counts often modeled as time-to-event (i.e. Exponential or Gamma distribution) -- Goal is to estimate event rate
  • Tricky b.c. ignoring censored data can lead to inferential errors
    • Left censored: we don't know when the timer started
    • Right censored the observation period finished before the event had time to happen

Example: Austin Cat Adoption

  • 20k cats
  • Events: adopted (1) or not (0)
  • Censoring mechanisms
    • death before adoption
    • escape
    • not adopted yet

Goal: determine if Black are adopted at a lower rate than non-Black cats.

Modeling outcome variable: days_to_event

Two go-to distributions for modeling time-to-event

Exponential Distribution:

  • simpler of the two (single parameter)
  • assumes constant rate
  • maximum entropy distribution amongst the set of non-negative continuous distributions that have the same average rate.
  • assumes a single thing to occur (e.g. part failure) before an event occurs (machine breakdown)

Gamma Distribution

  • more complex of the two (two parameters)
  • maximum entropy distribution amongst the set of distributions with the same mean and average log.
  • assumes multiple things to happen (e.g. part failures) before an event occurs (machine breakdown)

Censored and un-censored observations

  • Observed data use the Cumulative distribution; i.e. the probability that the event occurred by time x
  • Unobserved (censored) data instead require the Complementary of the CDF; i.e. the probability that the event hasn't happened yet.

Statistical Model

DiAi=1Exponential(λi)DiAi=0ExponentialCCDF(λi)λi=1μilogμi=αcat color[i]αBlack,OtherExponential(γ)\begin{align*} D_i | A_i &= 1 \sim \text{Exponential}(\lambda_i) \\ D_i | A_i &= 0 \sim \text{ExponentialCCDF}(\lambda_i) \\ \lambda_i &= \frac{1}{\mu_i} \\ \log \mu_i &= \alpha_{\text{cat color}[i]} \\ \alpha_{Black, Other} &\sim \text{Exponential}(\gamma) \end{align*}
  • $D_i | A_i = 1$ - observed adoptions
  • $D_i | A_i = 1$ - not-yet-observed adoptions
  • $\alpha_{\text{cat color}[i]}$ log average time-to-adoption for each cat color
  • $\log \mu_i$ -- link function ensures $\alpha$s are positive
  • $\lambda_i = \frac{1}{\mu_i}$ simplifies estimating average time-to-adoption

Finding reasonable hyperparameter for $\alpha$

We'll need to determine a reasonable data for the Exponential prior mean parameter $\gamma$. To do so, we'll look at the empirical distribution of time to adoption:

Using the above empirical historgram, we see that a majority of the probablity mass is between zero and 200, so let's use 50 as the expected wait time.

Poor black kitties 🐈‍⬛

It appears that black cats DO take longer to get adopted.

Posterior Summary

Posterior distributions

Authors

  • Ported to PyMC by Dustin Stansbury (2024)
  • Based on Statistical Rethinking (2023) lectures by Richard McElreath

:::{include} ../page_footer.md :::