Solving Scientific Problems
When solving problems on the edge of knowledge
- there's generally not an algorithm available to tell you how to solve it
- best we can do is see lots of examples
- derive a set of general heuristics for how to attack problems
- try solving a more accessible problem or subproblem
- e.g. try building a simpler model first, then add complexity
- ALWAYS CHECK YOUR WORK -- e.g. via simulation
You Don't always get what you want
- May not be able to estimate the desired estimand
- You'll likely have to compromise
- e.g. estimating the total effect vs the direct effect
- sensitivity analysis
- estimate counterfactuals
Ethics & Trolley Problem Studies
Principles studied
Researchers have tried to catalog trolly problem scenarios along multiple feature dimensions, three common features are:
- Action: taking an action is less morally permissible than not taking an action (you intervening on a scene is condidered worse than letting the scene play out)
- Intention: does the actor's direct goal affect the scenario (e.g. intentially killing one person to save five)
- Contact: intentions are worse if the actor comes into direct contact with the object of the action (e.g. directly pushing a person off a bridge to save others)
Dataset
- 9330 total responses
- 331 individuals
- 30 different trolley problem scenarios
- vary along action, intention, contact dimensions
- responses are ordered integer values ranging from 1 to 7 indicating the "appropriateness" of action
- not counts
- ordered but not continous ordered
- bounded
- Estimand
How do Action, intention, and contact influence response
Ordered Categories
- Discrete categories
- Categories have an order, thus 7 > 6 > 5, etc., but 7 not necessarily 7x that of 1
- Distance between each values is not constant, and unclear
- Anchor points (e.g. 4 here is "meh")
- Different people have different anchor points
Ordered = Cumulative
- rather than $p(x=5)$, $p(x<=5)$
Cumulative Distribution Function
Ordering responses with CDF
Using the CDF, we can establish cut points $\alpha_R$ on the cumulative log odds that correspond with the cumulative probability of that response (or one smaller). Thus the CDF gives us a proxy for order.
Calculating $P(R_i=k)$
P(Ri=k)=P(Ri≤k)−P(Ri≤k−1)For example, for $k = 3$
P(Ri=3)=P(Ri≤3)−P(Ri≤2)Setting up the GLM
We model the cumulative log odds as each break point $\alpha_k$
log1−P(R<=k)P(R<=k)=αkWhere's the GLM?
How can we make this a function of predictors
- Have a $\alpha_k$ for each predictor variable
- Use an offset $\phi_i$ for each data point that is a function of the predictors
log1−P(R<=k)P(R<=k)ϕi=αk+ϕi=βAAi+βCCi+βIIiDemonstrating the Effect of $\phi$ on Response Distribution
Changing $\phi$ "squishes" or "stretches" the cumulative histogram
Statistical Model
Starting off easy
RiϕiαjβA,C,I∼OrderedLogit(ϕi,α)=βAAi+βCCi+βIIi∼N(0,1)∼N(0,.5) Posterior Predictive Distributions
What about competing causes?
\begin{align*}
R_i &\sim OrderedLogit(\phi_i, \alpha) \\
\phi_i &= \beta_{A, G[i]} A_i + \beta_{C, G[i]} C_i + \beta_{I, G[i]} I_i \\
\alpha_j &\sim \mathcal N(0,1) \\
\beta_* &\sim \mathcal N(0, .5)
\end{align*}
$$*
Fit the gender-stratified model
Hang on! This is a voluntary sample.
Voluntary Samples, Participation, and Endogenous Selection
- Age, Education, and Gender all contribute to an unmeasured variable Participation
- Participation is a collider: conditioning causes, E, Y, G to covary
- Not actually possible to estimate the Total Effect of Gender
- We CAN estimate the Direct Effect of Gender $G$ by stratifying by Education $E$ and Age $Y$
Looking at the distribution of Education and Age in the sample
The observation that data sample's distribution of Education and Age are not aligned with the population's distribution provides evidence that these variables are likely confounding factors associated with participation.
Ordered Monotonic Predictors
Similar to the Response outcome, Education is also an Ordered category.
- unlikely that each level has the same effect on participation/response
- we would like a parameter for each level, while enforcing ordering so that each successive level has a larger magnitude effect than the previous.
For each level of education:
- (Elementary School) $\rightarrow \phi_i = 0$
- (Middle School) $\rightarrow \phi_i = \delta_1$
- (Some High School) $\rightarrow \phi_i = \delta_1 + \delta_2$
- (High School Graduate) $\rightarrow \phi_i = \delta_1 + \delta_2 + \delta_3$
- (Some College) $\rightarrow \phi_i = \delta_1 + \delta_2 + \delta_3 + \delta_4$
- (College Graduate) $\rightarrow \phi_i = \delta_1 + \delta_2 + \delta_3 + \delta_4 + \delta_5$
- (Master's Degreee) $\rightarrow \phi_i = \delta_1 + \delta_2 + \delta_3 + \delta_4 + \delta_5 + \delta_6$
- (Graduate Degreee) $\rightarrow \phi_i = \delta_1 + \delta_2 + \delta_3 + \delta_4 + \delta_5 + \delta_6+ \delta_7 = \beta_E$
where $\beta_E$ is the maximum effect of education. We thus break down the maximum effect into a convex combination of education terms.
δ0ϕi=0=j=0∑Ei−1δj=1Ordered Monotonic Priors
- $\delta$ parameters form a Simplex--a vector of proportions and sum to 1
- The simplex parameter space is modeled by a Dirichlet distribution
- Dirichlet gives us a distribution over distributions
Demonstrating the parameterization of the Dirichlet distribution
Assessing the Direct Effect of Education: Stratifying by Gender & Age
McElreath builds a model for the Total Effect of gender in the lecture, but then points out that due to the backdoor path caused by gender (via participation), we can't interpret total cause of education. We thus need need to also stratify by gender.
Complex Causal Effects
A few lessons here is that complex causal graphs seem like a lot work, but allow us to
- map out an explicit generative model
- map out an explicit estimand for a target causal effect -- we need to identify the correct adjustment set
- generate simulations of nonlinear causal relationships and counterfactuals -- DON'T DIRECTLY INTERPRET PARAMS, GENERATE PREDICTIONS
Repeated Observations
Note that some dimensions have repeat observations -- e.g. the story ID and the responder ID. We can leverage these repeat observations to estimate unobserved phenomena like individual response bias (similar to wine judge discrimination level), or story bias
BONUS: Description and Causal Inference
Mostly a review of previous studies, not much in terms of technical notes. The main point being that description (and prediction), which are generally considered orthogonal to causal modeling, actually involve a causal model when performed correctly.
Things to look out for
- Quality data > bigger data
- Bigger, biased data magnifies bias
- Better models > averages (XBox poling). Better (causal) models can address unrepresentative samples
- Post-stratification
- Still effects descriptive models
- NO CAUSES IN; NO DESCRIPTIONS OUT
- Selection nodes
- can be incorporated into causal models to capture non-uniform participation
- The right action depends on the causes of selection
- Always think carefully about potentially unmodeled selection bias
4-step plan for honest digital scholarship
- Establish what we're trying to describe
- What is the ideal data for this description?
- What data do we actually have? This is almost never (2).
- What are the causes of the differences between (2) and (3)?
- (Optional) Can we use the data we actally have (3) and the model of what caused that data (4) to estimate what we're trying to describe (1)
Authors
- Ported to PyMC by Dustin Stansbury (2024)
- Based on Statistical Rethinking (2023) lectures by Richard McElreath
:::{include} ../page_footer.md
:::