09-Modeling_Events.ipynb

https://github.com/pymc-devs/pymc-examples/blob/main/examples/statistical_rethinking_lectures/09-Modeling_Events.ipynb

(lecture_09)=

Modeling Events

:::{post} Jan 7, 2024 :tags: statistical rethinking, bayesian inference, logistic regression :category: intermediate :author: Dustin Stansbury :::

This notebook is part of the PyMC port of the Statistical Rethinking 2023 lecture series by Richard McElreath.

Video - Lecture 09 - Modeling Events# Lecture 09 - Modeling Events

UC Berkeley Admissions

Dataset

4562 Graduate student applications
Stratified by
- Department
- Gender

Goal is to identify gender discrimination by admissions officers

Modeling Events

Events: discrete, unordered outcomes
Observations are counts/aggregates
Unknowns are probabilities $p$ of event ocurring, or odds of those probabilities $\frac{p}{1-p}$
All things we stratify by interact -- generally never independent in real life
Often deal with the Log Odds of $p = \log \left (\frac{p}{1-p} \right)$

Admissions: Drawing the owl 🦉

Estimands(s)
Scientific Models(s)
Statistical Models(s)
Analysis

1. Estimand(s)

Which path defines "discrimination"

Direct Discrimination (Causal Effect)

aka "Institutional discrimination"
Referees are biased for or against a particular group
Usually the type we're interested in identifying if it exists
Requires strong causal assumptions

Here, deparment, D is a mediator -- this is a common structure in social sciences, where categorical status (e.g. gender) effects some mediating context (e.g. occupation), both of which affect a target outcome (wage). Examples

wage discrimination
hiring
scientific awards

Indirect Discrimination

aka "Structural discrimination"
e.g. Gender affects a person's interests, and therefore the department they will apply to
Affects overall admission rates
Requires strong causal assumptions

Total Discrimination

aka "Experienced discrimination"
through both direct and indirect routes
Requires mild assumptions

Estimands & Estimators

Each of the different estimands require a different estimators
Often the thing we can estimate often isn't what we want to estimate
e.g. Total discrimination may be easier to estimate, but is less actionable

Unobserved Confounds

It's always possible there are also confounds between the mediator and some unobserved confounds. We will ignore these for now.

2. Scientific Model(s):

Simulate the process

Below is a generative model of the review/admission process

Simulated data properties

Both scenarios

Gender 0 tends to apply to department 0
Gender 1 tends to apply to department 1

Unbiased scenario:

due to lower acceptance rates in dept 0 and tendency of gender 0 to apply to dept 0, gender 0 has a lower overall rejection rate compared to gender 1
due to higher acceptance rates in dept 1 and tendency of gender 1 to apply to dept 1, gender 1 has a higher overall acceptance rate compared to gender 0
even in the case of no (direct) gender discrimination, there is still indirect discrimination based on tendency of genders to apply to different departments, and the unqual likelihood that each department accepts students.

Biased scenario

overall acceptance rates are lower (due to baseline reduction in gender 0 acceptance rates across both departments)
in the scenario where there is actual department bias, we see a similar overall pattern of discrimination as the unbiased case due to the indirect effect.

3. Statistical Models

Modeling Events

We observe counts of events
We estimate probabilities -- or, rather, the log-odds of events ocurring

Linear Models

Expected value is linear (additive) combination of parameters

\begin{align*} Y_i &\sim \mathcal{N}(\mu_i, \sigma) \\ \mu_i &= \alpha + \beta_X X_i + ... \end{align*}

b.c. Normal distribution is unbounded, so too is the expected value of the linear model.

Event Models

Discrete events either occur, taking the value 1, or they do not, taking the value 0. This puts bounds on the expected value of an event. Namely the bounds are on the interval $(0, 1)$

Generalized Linear Models & Link Functions

Expected value is some function $f(\mathbb{E}[\theta])$ of an additive combination of the parameters.

f(\mathbb{E}[\theta]) = \alpha + \beta_X X_i + ...

Generally, this function $f()$, called the link function, will have a specific form that is associated with the likelihood distribution.
the link function will have an inverse link function $ f^{-1}(X_i, \theta)$ such that
to reiterate link functions are matched to distributions

\mathbb{E}[\theta] = f^{-1}(\alpha + \beta_X X_i + ...)

In the linear regression case the likelihood model is Gaussian, and the associated link function (and inverse link function) is just the identity, $I()$. (left plot below)

Logit Link Function and Logisitic Inverse function

Binary outcomes

Logistic regression used to model the probability of an event ocurring
Likelihood function is the Bernouilli
Associated link function is the log-odds, or logit: $f(p) = \frac{p}{1-p}$

\begin{align*} Y_i &\sim \text{Bernoulli}(p_i) \\ f(p_i) &= \alpha + \beta_X X_i + ... \end{align*}

the inverse link function is the inverse logit aka logistic function (right plot above): $f^{-1}(X_i, \theta) = \frac{1}{1 + e^{-(\alpha + \beta_X X_i + ...)}}$
- defining $\alpha + \beta_X X + ... = q$ (ignoring the data index $i$ for now), the derivation of the inverse logit is as follows

\begin{align} \log \left(\frac{p}{1-p}\right) &= \alpha + \beta_X X + ... = q \\ \frac{p}{1-p} &= e^{q} \\ p &= (1-p) e^{q} \\ p + p e^{q} &= e^{q} \\ p(1 + e^{q}) &= e^{q} \\ p &= \frac{ e^{q}}{(1 + e^{q})} \\ p &= \frac{ e^{q}}{(1 + e^{q})} \frac{e^{-q}}{e^{-q}}\\ p &= \frac{1}{(1 + e^{-q})} \\ \end{align}

logit is a harsh transform

Interpreting the log odds can be difficult at first, but in time becomes easier

log-odds scale
$\text{logit}(p)=0, p=0.5$
$\text{logit}(p)=-\infty, p=0$
- rule of thumb: $\text{logit}(p)<-4$ means event is unlikely (hardly ever)
$\text{logit}(p)=\infty, p=1$
- rule of thumb: $\text{logit}(p)>4$ means event is very likely (nearly always)

Bayesian Updating for Logistic Regression

For the following simulation, we'll use a custom utility function utils.simulate_2_parameter_bayesian_learning for simulating general Bayeisan posterior update simulation. Here's the API for that function (for more details see utils.py)

Priors for logistic regression

logit link function is a harsh transform

Logit compresses parameter distributions
$x > +4 \rightarrow $ event basically always occurs
$x < -4 \rightarrow$ event basically never occurs

Prior Predictive Simulations

Statistical models for admissions

Again, the estimator will depend on the estimand

Total Causal Effect

Stratify by only Gender. Don't stratify by Department b.c. it's a Pipe (mediator) that we do not want to block

\begin{align*} A_i &\sim \text{Bernoulli}(p=p_i) \\ \text{logit}(p_i) &= \alpha[G_i] \\ \alpha &= [\alpha_0, \alpha_1] \\ \alpha_j &\sim \text{Normal}(0, 1) \end{align*}

Direct Causal Effect

Stratify by Gender and Department to block the Pipe

\begin{align*} A_i &\sim \text{Bernoulli}(p_i) \\ \text{logit}(p_i) &= \alpha[G_i, D_i] \\ \alpha &= \left[{\begin{array}{cc} \alpha_{0,0}, \alpha_{0,1} \\ \alpha_{1,0}, \alpha_{1,1} \end{array}}\right] \\ \alpha_{j,k} &\sim \text{Normal}(0, 1) \end{align*}

Fitting Total Causal Effect Model

Fit Total Effect model on BIASED simulated admissions

Fit Total Effect model on UNBIASED simulated admissions

Fitting Direct Causal Effect Model

Fit Direct Effect model to BIASED simulated admissions data

For comparison, here's the ground truth biased admission rates, which we're able to mostly recover:

Fit Direct Effect model to UNBIASED simulated admissions data

For comparison, here's the ground truth unbiased admission rates, which were able to recover:

4. Analyze the UC Berkeley Admissions data

Review of the counts dataset

we'll use these counts data to model log odds of acceptance rate for gender/department
note that we'll be using Binomial Regression, which is equivalent to Bernoulli regression, but operates on aggregate counts, as oppose to individual binary trials
- The examples above were actually implemented as Binomial Regression so that we can re-use code and demonstrate general patterns of analysis
- Either way, you'll get the same inference using both approaches

Fit Total Effect model to UC Berkeley admissions data

Don't forget to look at diagnostics, which we'll skip here

Total Causal Effect

Fit Direct Effect model to UC Berkeley admissions data

Direct Causal Effect

Average Direct Effect

The causal effect of gender averaged across departments (marginalize)
Depends on the distribution of applications to each deparment
This is easy to do as a simulation

Post-stratification

Re-weight estimates for the target population
allows us to apply model fit from one university to estiamte causal impact at another university with a different distribution of departments
Here, we use the empirical distribution for re-weighting estimates

To verify the averaging process, we can look at the contrast of the p_accept samples from the posterior, which provides similar results. However, looking ath the posterior obviously wouldn't work for making predictions for an out-of-sample university however.

Discrimination?

Hard to say

Big structural effects
Distribution of applicants could be due to discrimination
Confounds are likely

BONUS: Survival Analysis

Counts often modeled as time-to-event (i.e. Exponential or Gamma distribution) -- Goal is to estimate event rate
Tricky b.c. ignoring censored data can lead to inferential errors
- Left censored: we don't know when the timer started
- Right censored the observation period finished before the event had time to happen

Example: Austin Cat Adoption

20k cats
Events: adopted (1) or not (0)
Censoring mechanisms
- death before adoption
- escape
- not adopted yet

Goal: determine if Black are adopted at a lower rate than non-Black cats.

Modeling outcome variable: `days_to_event`

Two go-to distributions for modeling time-to-event

Exponential Distribution:

simpler of the two (single parameter)
assumes constant rate
maximum entropy distribution amongst the set of non-negative continuous distributions that have the same average rate.
assumes a single thing to occur (e.g. part failure) before an event occurs (machine breakdown)

Gamma Distribution

more complex of the two (two parameters)
maximum entropy distribution amongst the set of distributions with the same mean and average log.
assumes multiple things to happen (e.g. part failures) before an event occurs (machine breakdown)

Censored and un-censored observations

Observed data use the Cumulative distribution; i.e. the probability that the event occurred by time x
Unobserved (censored) data instead require the Complementary of the CDF; i.e. the probability that the event hasn't happened yet.

Statistical Model

\begin{align*} D_i | A_i &= 1 \sim \text{Exponential}(\lambda_i) \\ D_i | A_i &= 0 \sim \text{ExponentialCCDF}(\lambda_i) \\ \lambda_i &= \frac{1}{\mu_i} \\ \log \mu_i &= \alpha_{\text{cat color}[i]} \\ \alpha_{Black, Other} &\sim \text{Exponential}(\gamma) \end{align*}

$D_i | A_i = 1$ - observed adoptions
$D_i | A_i = 1$ - not-yet-observed adoptions
$\alpha_{\text{cat color}[i]}$ log average time-to-adoption for each cat color
$\log \mu_i$ -- link function ensures $\alpha$s are positive
$\lambda_i = \frac{1}{\mu_i}$ simplifies estimating average time-to-adoption

Finding reasonable hyperparameter for $\alpha$

We'll need to determine a reasonable data for the Exponential prior mean parameter $\gamma$. To do so, we'll look at the empirical distribution of time to adoption:

Using the above empirical historgram, we see that a majority of the probablity mass is between zero and 200, so let's use 50 as the expected wait time.

Poor black kitties 🐈‍⬛

It appears that black cats DO take longer to get adopted.

Posterior Summary

Posterior distributions

Authors

Ported to PyMC by Dustin Stansbury (2024)
Based on Statistical Rethinking (2023) lectures by Richard McElreath

:::{include} ../page_footer.md :::

09-Modeling_Events.ipynb

Modeling Events

UC Berkeley Admissions

Dataset

Modeling Events

Admissions: Drawing the owl 🦉

1. Estimand(s)

Which path defines "discrimination"

Direct Discrimination (Causal Effect)

Indirect Discrimination

Total Discrimination

Estimands & Estimators

Unobserved Confounds

2. Scientific Model(s):

Simulate the process

Simulated data properties

Both scenarios

Unbiased scenario:

Biased scenario

3. Statistical Models

Modeling Events

Linear Models

Event Models

Generalized Linear Models & Link Functions

Logit Link Function and Logisitic Inverse function

Binary outcomes

logit is a harsh transform

Bayesian Updating for Logistic Regression

Priors for logistic regression

Prior Predictive Simulations

Statistical models for admissions

Total Causal Effect

Direct Causal Effect

Fitting Total Causal Effect Model

Fit Total Effect model on BIASED simulated admissions

Fit Total Effect model on UNBIASED simulated admissions

Fitting Direct Causal Effect Model

Fit Direct Effect model to BIASED simulated admissions data

Fit Direct Effect model to UNBIASED simulated admissions data

4. Analyze the UC Berkeley Admissions data

Review of the counts dataset

Fit Total Effect model to UC Berkeley admissions data

Total Causal Effect

Fit Direct Effect model to UC Berkeley admissions data

Direct Causal Effect

Average Direct Effect

Post-stratification

Discrimination?

BONUS: Survival Analysis

Example: Austin Cat Adoption

Modeling outcome variable: days_to_event

Censored and un-censored observations

Statistical Model

Finding reasonable hyperparameter for $\alpha$

Poor black kitties 🐈‍⬛

Posterior Summary

Posterior distributions

Authors

Modeling outcome variable: `days_to_event`