GLM-ordinal-features.ipynb

https://github.com/pymc-devs/pymc-examples/blob/main/examples/generalized_linear_models/GLM-ordinal-features.ipynb

(GLM-ordinal-features.ipynb)=

GLM-ordinal-features

:::{post} Oct 27, 2024 :tags: ordinal features, ordinal regression, generalized linear model, bayesian workflow, r-datasets :category: intermediate, reference :author: Jonathan Sedar :::

Ordinal Exogenous Feature: Worked Example with Bayesian Workflow

Here we use an ordinal exogenous predictor feature within a model:

... this is in contrast to estimating an ordinal endogenous target feature, which we show in another notebook

Disclaimer:

This Notebook is a worked example only, it's not intended to be an academic reference
The theory and math may be incorrect, incorrectly notated, or incorrectly used
The code may contain errors, inefficiencies, legacy methods, and the text may have typos
Use at your own risk!

Discussion
Setup
0. Curate Dataset
1. Model A: The Wrong Way - Simple Linear Coefficients
2. Model B: A Better Way - Dirichlet Hyperprior Allocator

Discussion

Problem Statement

Human action and economics is all about expressing our ordinal preferences between limited options in the real-world.

We often encounter real-world situations and datasets where a predictor feature is an ordinal category recording a preference or summarizing a metric value, and is particularly common in insurance and health. For example:

As a totally subjective opinion which can be different between observations (e.g.
) - these are difficult to work with and a symptom of poor data design
On a subjective but standardized scale (e.g. ["strongly disagree", "disagree", "agree", "strongly agree"]) this is the approach of the familiar Likert scale
As a summary binning of a real objective value on a metric scale (e.g. binning ages into age-groups ["<30", "30 to 60", "60+"]), or a subjective value that's been mapped to a metric scale (e.g. medical health self-scoring ["0-10%", ..., "90-100%"]) - these are typically a misuse of the metric because the data has been compressed (losing information), and the reason for the binning and the choices of bin-edges are usually not known

In all these cases the critical issue is that the categorical values and their ordinal rank doesn't necessarily relate linearly to the target variable. For example in a 4-value Likert scale (shown above) the relative effect of "strongly agree" (rank 4) is probably not "agree" (rank 3) plus 1: 3+1 != 4.

Another way to put it is the metric distance between ordinal categories is not known and can be unequal. For example in that 4-value Likert scale (shown above) the difference between "disagree" vs "agree" is probably not the same as between "agree" vs "strongly agree".

These properties can unfortunately encourage modellers to incorporate ordinal features as either a categorical (with infinite degrees of freedom - so we lose ordering / rank information), or as a numeric coefficient (which ignores the unequal spacing, non-linear response). Both are poor choices and have subtly negative effects on the model performance.

A final nuance is that we might not see the occurrence of all valid categorial ordinal levels in the training dataset. For example we might know a range is measured ["c0", "c1", "c2", "c3"] but only see ["c0", "c1", "c3"]. This is a missing data problem which could further encourage the misuse of a numeric coefficient to average or "interpolate" a value. What we should do is incorporate our knowledge of the data domain into the model structure to autoimpute a coefficient value. This is actually the case in this dataset here (see Section 0)!

Data & Models Demonstrated

Our problem statement is that when faced with such ordinal features we want to:

Infer a series of prior allocators that transform the ordinal categories into a linear (polynomial) scale
Predict the endogenous feature as usual, having captured the information from the ordinals

This notebook takes the opportunity to:

Demonstrate a general method using a constrained Dirichlet prior, based on {cite:p}burkner2018 and demonstrated in a pymc-specific example by Austin Rochford {cite:p}rochford2018
Improve upon both those methods by structurally correcting for a missing level value in an ordinal feature
Demonstrate a reasonably complete Bayesian workflow {cite:p}gelman2020bayesian including data curation and grabbing data from an RDataset

This notebook is a partner to another notebook (TBD) where we estimate an ordinal endogenous target feature.

Setup

:::{include} ../extra_installs.md :::

0. Curate Dataset

We use the same health dataset as in {cite:p}burkner2018, named ICFCoreSetCWP.RData, available in an R package ordPens

Per the Bürkner paper (Section 4: Case Study) this dataset is from a study ** of Chronic Widespread Pain(CWP) wherein 420 patients were self-surveyed on 67 health features (each a subjective ordinal category) and also assigned a differently generated (and also subjective) measure of physical health. In the dataset these 67 features are named e.g.
b1602, b102, ... d450, d455, ... s770 etc, and the target feature is named phcs.**

Per the Bürkner paper we will subselect 2 features d450, d455 (which measure an impairment of patient walking ability on a scale [0 to 4] ["no problem" to "complete problem"]) and use them to predict phcs.

Quite interestingly, for feature d450, the highest ordinal level value 4 is not seen in the dataset, so we have a missing data problem which could further encourage the misuse of a numeric coefficient to average or "interpolate" a value. What we should do is incorporate our knowledge of the data domain into the model structure to auto-impute a coefficient value. This means that our model can make predictions on new data where a d450=4 value might be seen.

** Just for completeness (but not needed for this notebook) that study is reported in Gertheiss, J., Hogger, S., Oberhauser, C., & Tutz, G. (2011). Selection of ordinally 784 scaled independent variables with applications to international classification of functioning 785 core sets. Journal of the Royal Statistical Society: Series C (Applied Statistics), 60 (3), 786 377–395.**

NOTE some boilerplate steps are included but ~~struck through~~ with and explanatory comment e.g. "Not needed in this simple example". This is to preserve the logical workflow which is more generally useful

0.1 Extract

Annoyingly but not surprisingly for an R project, despite being a small, simple table, the dataset is only available in an obscure R binary format, and tarred, so we'll download, unpack and store locally as a normal CSV file. This uses the rather helpful pyreadr package.

Observe:

Looks okay - if this was a proper project we'd want to know what those cryptic column headings actually mean
For this purpose we'll only use a couple of the features [d450, d455] and will press ahead

0.2 Clean

0.2.1 Clean Observations

Not needed in this simple example

0.2.2 Clean Features

0.2.2.1 Rename Features

Nothing really needed, will rename the index when we force dtype and set index

0.2.2.2 Correct Features

Not needed in this simple example

0.2.2.3 Force Datatypes

Force `d450` to string representation and ordered categorical dtype (supplied as an int which is unhelpful)

NOTE force the categorical levels to include c4 which is valid in the data domain but unobserved

Force `d455` to string representation and ordered categorical dtype (supplied as an int which is unhelpful)

0.2.2.4 Create and set indexes

0.3 Very limited quick EDA

0.3.1 Univariate target `phcs`

Observe:

phcs is a subjective scored measure of physical health, see {cite:p}burkner2018 for details
Seems well-behaved, unimodal, smooth

0.3.2 Target `phcs` vs `['d450', 'd455']`

Observe:

phcs vs d450:

c0 wider and higher: possibly a catch-all category because heavily observed too
c3 fewer counts
c4 is not observed: it's missing from the data despite being valid in the data domain

phcs vs d455:

c1 and c2 look very similar
c4 fewer counts

0.4 Transform dataset to `dfx` for model input

IMPORTANT NOTE

Reminder that Bayesian inferential methods do not need a test dataset (nor k-fold cross validation) to fit parameters. We also do not need a holdout (out-of-sample) dataset to evaluate model performance, because we can use in-sample PPC, LOO-PIT and ELPD evaluations
So we use the entire dataset df as our model input
Depending on the real-world model implementation we might:
- Separate out a holdout set (even though we dont need it) to eyeball the predictive outputs, but here we have a summarized dataset, so this isn't possible nor suitable
- Create a forecast set to demonstrate how we would use the model and it's predictive outputs in Production.

NOTE:

This is an abbreviated / simplified transformation process which still allows for the potential to add more features in future

Map ordinal categorical to an ordinal numeric (int) based on its preexisting categorical order

Transform (zscore and scale) numerics

0.5 Create `forecast` set and convert to `dffx` for model input

NOTE: We depart from the datasets used in {cite:p}rochford2018 and {cite:p}burkner2018 to make sure our forecast dataset contains all valid levels of d450 and d455. Specifically, the observed dataset does not contain the domain-valid d450 = 4, so we will make sure that our forecast set does include it.

Convert to dffx for model input

1. Model A: The Wrong Way - Simple Linear Coefficients

This is a simple linear model where we acknowledge that the categorical features are ordered, but ignore that the impact of the factor-values on the coefficient might not be equally-spaced, and instead just assume equal spacing:

\begin{align} \sigma_{\beta} &\sim \text{InverseGamma}(11, 10) \\ \beta &\sim \text{Normal}(0, \sigma_{\beta}, \text{shape}=j) \\ \\ \text{lm} &= \beta^{T}\mathbb{x}_{i,j} \\ \epsilon &\sim \text{InverseGamma}(11, 10) \\ \hat{y_{i}} &\sim \text{Normal}(\mu=\text{lm}, \epsilon) \\ \end{align}

where:

Observations $i$ contain numeric features $j$, and $\hat{y_{i}}$ is our estimate, here of phcs
The linear sub-model $\beta^{T}\mathbb{x}_{ij}$ lets us regress onto those features
Notably:
- $\mathbb{x}_{i,d450}$ is treated as a numeric feature
- $\mathbb{x}_{i,d455}$ is treated as a numeric feature

1.1 Build Model Object

Verify the built model structure matches our intent, and validate the parameterization

1.2 Sample Prior Predictive, View Diagnostics

1.2.1 In-Sample Prior PPC (Retrodictive Check)

Observe:

Values are wrong as expected, but range is reasonable

1.2.2 Quick look at selected priors

Observe:

beta_sigma, beta: (levels), epsilon all have reasonable prior ranges as specified

1.3 Sample Posterior, View Diagnostics

1.3.1 Sample Posterior and PPC

1.3.2 Traces

Observe:

Samples well-mixed and well-behaved
- ess_bulk is good, r_hat is good
Marginal energy | energy transition looks reasonable
- E-BFMI > 0.3 so apparently reasonable

1.3.3 In-Sample Posterior PPC (Retrodictive Check)

Observe:

In-sample PPC phcs_hat tracks the observed phcs moderately well: slightly overdispersed, perhaps a likelihood with fatter tails would be more appropriate (e.g. StudentT)

1.3.4 In-Sample PPC LOO-PIT

Observe:

LOO-PIT looks good, again slightly overdispersed but acceptable for use

1.3.5 Compare Log-Likelihood vs Other Models

1.4 Evaluate Posterior Parameters

1.4.1 Univariate

Lots of parameters, let's take our time

Observe:

beta_sigma: E ~ 10 indicates need for high variance in locations of betas
beta: intercept: E ~ 32 confirms the bulk of the variance in betas locations is simply due to the intercept offset required to get the zscored values into range of phcs, no problem
beta: d450_num: E ~ -3 negative, HDI94 does not span 0, substantial effect, smooth central distribution:
- Higher values of d450_num create a reduction in phcs_hat
beta: d455_num: E ~ -2 negative, HDI94 does not span 0, substantial effect, smooth central distribution
- Higher values of d455_num create a smaller reduction in phcs_hat
epsilon: E ~ 7 indicates quite a lot of variance still in the data, not yet handled by a modelled feature

1.5 Create PPC Forecast on simplified `forecast` set

Just for completeness, just compare to Figure 3 in the Bürkner paper and Rochford's blogpost. Those plots summarize to a mean though, which seems unnecessary - let's improve it a little with full sample posteriors

Replace dataset with `dffx` and rebuild

1.5.2 View Predictions

Observe:

Compare this to the final plots in {cite:p}rochford2018 and Figure 12 in {cite:p}burkner2018
We see the linear responses and equal spacings of d450 and of d455 when treated as numeric values
We also see that mdla technically can make predictions for d450=c4 which is not seen in the data. However, this prediction is a purely linear extrapolation and although helpful in a sense, could be completely and misleadingly wrong in this model specification
Note here we plot the full posteriors on each datapoint (rather than summarise to a mean) which emphasises the large amount of variance still in the data & model

2. Model B: A Better Way - Dirichlet Hyperprior Allocator

This is an improved linear model where we acknowledge that the categorical features are ordinal and allow the ordinal values to have a non-equal spacing, For example, it might well be that A > B > C, but the spacing is not metric: instead A >>> B > C. We achieve this using a Dirichlet hyperprior to allocate hetrogenously spaced sections of a linear coefficient:

\begin{align} \sigma_{\beta} &\sim \text{InverseGamma}(11, 10) \\ \beta &\sim \text{Normal}(0, \sigma_{\beta}, \text{shape}=j) \\ \\ \beta_{d450} &\sim \text{Normal}(0, \sigma_{\beta}) \\ \chi_{d450} &\sim \text{Dirichlet}(1, \text{shape}=k_{d450}) \\ \nu_{d450} &\sim \beta_{d450} * \sum_{k=0}^{k=k_{d450}}\chi_{d450} \\ \\ \beta_{d455} &\sim \text{Normal}(0, \sigma_{\beta}) \\ \chi_{d455} &\sim \text{Dirichlet}(1, \text{shape}=k_{d455}) \\ \nu_{d455} &\sim \beta_{d455} * \sum_{k=0}^{k=k_{d455}}\chi_{d455} \\ \\ lm &= \beta^{T}\mathbb{x}_{i,j} + \nu_{d450}[x_{i,d450}] + \nu_{d455}[x_{i,d455}]\\ \epsilon &\sim \text{InverseGamma}(11, 10) \\ \hat{y_{i}} &\sim \text{Normal}(\mu=lm, \epsilon) \\ \end{align}

where:

Observations $i$ contain numeric features $j$ and ordinal categorical features $k$ (here d450, d455) which each have factor value levels $k_{d450}, k_{d455}$ and note per Section 0, these are both in range [0 - 4] as recorded by notebook variable LVLS_D450_D455
$\hat{y_{i}}$ is our estimate, here of phcs
The linear sub-model $lm = \beta^{T}\mathbb{x}{i,j} + \nu{d450}[x_{i,d450}] + \nu_{d455}[x_{i,d455}]$ lets us regress onto those features
Notably:
- $\mathbb{x}{i,d450}$ is treated as an ordinal feature and used to index $\nu{d450}[x_{i,d450}]$
- $\mathbb{x}{i,d455}$ is treated as an ordinal feature and used to index $\nu{d455}[x_{i,d455}]$
NOTE: The above spec is not particularly optimised / vectorised / DRY to aid explanation

2.1 Build Model Object

Verify the built model structure matches our intent, and validate the parameterization

2.2 Sample Prior Predictive, View Diagnostics

2.2.1 In-Sample Prior PPC (Retrodictive Check)

Observe:

Values are wrong as expected, but range is reasonable

2.2.2 Quick look at selected priors

Observe:

Several new parameters!
beta_sigma, beta: (levels), epsilon: all have reasonable prior ranges as specified
*_d450:
- chi_*: obey the simplex constraint of the Dirichlet and span the range
- nu_*: all reasonable as specified, note the ordering already present in the prior
*_d455:
- chi_*: obey the simplex constraint of the Dirichlet and span the range
- nu_*: all reasonable as specified, note the ordering already present in the prior

2.3 Sample Posterior, View Diagnostics

2.3.1 Sample Posterior and PPC

2.3.2 Traces

Observe:

Samples well-mixed and well-behaved, but note we raised target_accept=0.9 to mitigate / avoid divergences seen at 0.8
- ess_bulk a little low, r_hat is okay
Marginal energy | energy transition looks reasonable
- E-BFMI > 0.3 so apparently reasonable

2.3.3 In-Sample Posterior PPC (Retrodictive Check)

Observe:

In-sample PPC phcs_hat tracks the observed phcs moderately well: slightly overdispersed, perhaps a likelihood with fatter tails would be more appropriate (e.g. StudentT)

2.3.4 In-Sample PPC LOO-PIT

Observe:

LOO-PIT looks good, again slightly overdispersed but acceptable for use

2.3.5 Compare Log-Likelihood vs Other Models

Observe:

Our new ordinal-respecting mdlb appears to be the winner, taking nearly all the weight and a higher elpd_loo

2.4 Evaluate Posterior Parameters

2.4.1 Univariate

Lots of parameters, let's take our time

Observe:

beta_sigma: E ~ 12 indicates need for high variance in locations of betas
beta: intercept: E ~ 41 confirms the bulk of the variance in betas locations is simply due to the intercept offset required to get the zscored values into range of phcs, no problem
epsilon: E ~ 7 indicates quite a lot of variance still in the data, not yet handled by a modelled feature
beta: d450: E ~ -12 negative, HDI94 does not span 0, substantial effect, smooth central distribution:
- Higher indexes of d450_idx create a reduction in phcs_hat
beta: d455: E ~ -7 negative, HDI94 does not span 0, substantial effect, smooth central distribution
- Higher indexes of d455_idx create a reduction in phcs_hat

In general the bigger coefficient values here (vs mdla) suggest more disrimination between the values in the data and better performance

Observe:

Interesting pattern:

chi_d450: Non-linear response throughout the range
nu_d450: The non-linear effect beta * chi.csum() is clear, in particular c0 is far from the trend of c1, c2, c3*

Note in particular that the posterior distribution of chi_d450 = c4 is almost exactly the same value as for its prior, because it hasn't been evidenced in the dataset. The constraint of the Dirichlet has in turn scaled the values for chi_d450 = c0 to c3, the scale of beta_450, and downstream effects on nu_d450 = c4.

For comparison you can try the inferior alternative by setting COORDS['d450_nm']=list(df[d450].cat.categories) in the model spec in Section 2.1 and re-running and seeing what happens

Observe:

Interesting pattern:

chi_d455: Non-linear response throughout the range
nu_d455: The non-linear effect beta * chi.csum() is clear, in particular c2 is almost the same as c1*

Let's see those levels forestplotted to make even more clear

Monotonic priors forestplot

Observe:

Here we see the same patterns in more detail, in particular:

nu_d450:
- c0 is an outlier with disproportionately less impact than c1, c2, c3
- c4 has been auto-imputed and takes the prior value which has very wide variance around a linear extrapolation
nu_d455: c1, c2 overlap strongly and so have very similar impact to one another

2.5 Create PPC Forecast on simplified `forecast` set

Just for completeness, just compare to Figure 3 in the Bürkner paper and Rochford's blogpost.

Those plots summarize to a mean though, which seems unnecessary - let's improve it a little.

2.5.1 Replace dataset with `dffx`, rebuild, and sample PPC

2.5.2 View Predictions

Observe:

Compare this to the final plots in {cite:p}rochford2018 and Figure 12 in {cite:p}burkner2018
We see the non-linear responses and non-equal spacings of d450 and of d455 when treated as ordinal categories
In particular, note the behaviours we already saw in the posterior plots
- LHS plot d450: all points for c0 are all higher than the plot in Section 1.5.2 (also note the overlap of d455: c1, c2 levels in the shaded points)
- RHS plot d455: all points for c1, c2 overlap strongly (also note d455 c0 outlying)
We also see that mdlb can make predictions for d450=c4 which is not seen in the data
Note here we plot the full posteriors on each datapoint (rather than summarise to a mean) which emphasises the large amount of variance still in the data & model

Errata

Authors

Created by Jonathan Sedar in Oct 2024

Reference

:::{bibliography} :filter: docname in docnames :::

Watermark

:::{include} ../page_footer.md :::

GLM-ordinal-features.ipynb

GLM-ordinal-features

Ordinal Exogenous Feature: Worked Example with Bayesian Workflow

Contents

Discussion

Problem Statement

Data & Models Demonstrated

Setup

0. Curate Dataset

0.1 Extract

0.2 Clean

0.2.1 Clean Observations

0.2.2 Clean Features

0.2.2.1 Rename Features

0.2.2.2 Correct Features

0.2.2.3 Force Datatypes

Force d450 to string representation and ordered categorical dtype (supplied as an int which is unhelpful)

Force d455 to string representation and ordered categorical dtype (supplied as an int which is unhelpful)

0.2.2.4 Create and set indexes

0.3 Very limited quick EDA

0.3.1 Univariate target phcs

0.3.2 Target phcs vs ['d450', 'd455']

0.4 Transform dataset to dfx for model input

0.5 Create forecast set and convert to dffx for model input

1. Model A: The Wrong Way - Simple Linear Coefficients

1.1 Build Model Object

Verify the built model structure matches our intent, and validate the parameterization

1.2 Sample Prior Predictive, View Diagnostics

1.2.1 In-Sample Prior PPC (Retrodictive Check)

1.2.2 Quick look at selected priors

1.3 Sample Posterior, View Diagnostics

1.3.1 Sample Posterior and PPC

1.3.2 Traces

1.3.3 In-Sample Posterior PPC (Retrodictive Check)

1.3.4 In-Sample PPC LOO-PIT

1.3.5 Compare Log-Likelihood vs Other Models

1.4 Evaluate Posterior Parameters

1.4.1 Univariate

1.5 Create PPC Forecast on simplified forecast set

Replace dataset with dffx and rebuild

1.5.2 View Predictions

2. Model B: A Better Way - Dirichlet Hyperprior Allocator

2.1 Build Model Object

Verify the built model structure matches our intent, and validate the parameterization

2.2 Sample Prior Predictive, View Diagnostics

2.2.1 In-Sample Prior PPC (Retrodictive Check)

2.2.2 Quick look at selected priors

2.3 Sample Posterior, View Diagnostics

2.3.1 Sample Posterior and PPC

2.3.2 Traces

2.3.3 In-Sample Posterior PPC (Retrodictive Check)

2.3.4 In-Sample PPC LOO-PIT

2.3.5 Compare Log-Likelihood vs Other Models

2.4 Evaluate Posterior Parameters

2.4.1 Univariate

Monotonic priors forestplot

2.5 Create PPC Forecast on simplified forecast set

2.5.1 Replace dataset with dffx, rebuild, and sample PPC

2.5.2 View Predictions

Errata

Authors

Reference

Watermark

Force `d450` to string representation and ordered categorical dtype (supplied as an int which is unhelpful)

Force `d455` to string representation and ordered categorical dtype (supplied as an int which is unhelpful)

0.3.1 Univariate target `phcs`

0.3.2 Target `phcs` vs `['d450', 'd455']`

0.4 Transform dataset to `dfx` for model input

0.5 Create `forecast` set and convert to `dffx` for model input

1.5 Create PPC Forecast on simplified `forecast` set

Replace dataset with `dffx` and rebuild

2.5 Create PPC Forecast on simplified `forecast` set

2.5.1 Replace dataset with `dffx`, rebuild, and sample PPC