GLM-model-selection.ipynb

https://github.com/pymc-devs/pymc-examples/blob/main/examples/generalized_linear_models/GLM-model-selection.ipynb

(GLM-model-selection)=

GLM: Model Selection

:::{post} Jan 8, 2022 :tags: cross validation, generalized linear model, loo, model comparison, waic :category: intermediate :author: Jon Sedar, Junpeng Lao, Abhipsha Das, Oriol Abril-Pla :::

Introduction

A fairly minimal reproducible example of Model Selection using WAIC, and LOO as currently implemented in PyMC.

This example creates two toy datasets under linear and quadratic models, and then tests the fit of a range of polynomial linear models upon those datasets by using Widely Applicable Information Criterion (WAIC), and leave-one-out (LOO) cross-validation using Pareto-smoothed importance sampling (PSIS).

The example was inspired by Jake Vanderplas' blogpost on model selection, although Cross-Validation and Bayes Factor comparison are not implemented. The datasets are tiny and generated within this Notebook. They contain errors in the measured value (y) only.

Local Functions

We start writing some functions to help with the rest of the notebook. Only the some functions are key to understanding the notebook, the rest are convenience functions to make plotting more concise when needed and are hidden inside a toggle-able section; it is still available but you need to click to see it.

Generate toy datasets

Interactively draft data

Throughout the rest of the Notebook, we'll use two toy datasets created by a linear and a quadratic model respectively, so that we can better evaluate the fit of the model selection.

Right now, lets use an interactive session to play around with the data generation function in this Notebook, and get a feel for the possibilities of data we could generate.

$y_{i} = a + bx_{i} + cx_{i}^{2} + \epsilon_{i}$

where:
$i \in n$ datapoints

$\epsilon \sim \mathcal{N}(0,latent\_sigma\_y)$

:::{admonition} Note on outliers

We can use value p to set the (approximate) proportion of 'outliers' under a bernoulli distribution.
These outliers have a 10x larger latent_sigma_y
These outliers are labelled in the returned datasets and may be useful for other modelling, see another example Notebook: {ref}GLM-robust-with-outlier-detection :::

Observe:

I've shown the latent_error in errorbars, but this is for interest only, since this shows the inherent noise in whatever 'physical process' we imagine created the data.
There is no measurement error.
Datapoints created as outliers are shown in red, again for interest only.

Create datasets for modelling

We can use the above interactive plot to get a feel for the effect of the params. Now we'll create 2 fixed datasets to use for the remainder of the Notebook.

For a start, we'll create a linear model with small noise. Keep it simple.
Secondly, a quadratic model with small noise

Scatterplot against model line

Observe:

We now have two datasets df_lin and df_quad created by a linear model and quadratic model respectively.
You can see this raw data, the ideal model fit and the effect of the latent noise in the scatterplots above
In the following plots in this Notebook, the linear-generated data will be shown in Blue and the quadratic in Green.

Standardize

Create ranges for later ylim xim

Demonstrate simple linear model

This linear model is really simple and conventional, an OLS with L2 constraints (Ridge Regression):

$y = a + bx + \epsilon$

Define model using explicit PyMC method

Observe:

This simple OLS manages to make fairly good guesses on the model parameters - the data has been generated fairly simply after all - but it does appear to have been fooled slightly by the inherent noise.

Define model using Bambi

Bambi can be used for defining models using a formulae-style formula syntax. This seems really useful, especially for defining simple regression models in fewer lines of code.

Here's the same OLS model as above, defined using bambi.

Observe:

This bambi-defined model appears to behave in a very similar way, and finds the same parameter values as the conventionally-defined model - any differences are due to the random nature of the sampling.
We can quite happily use the bambi syntax for further models below, since it allows us to create a small model factory very easily.

Create higher-order linear models

Back to the real purpose of this Notebook, to demonstrate model selection.

First, let's create and run a set of polynomial models on each of our toy datasets. By default this is for models of order 1 to 5.

Create and run polynomial models

We're creating 5 polynomial models and fitting each to the chosen dataset using the functions create_poly_modelspec and run_models below.

View posterior predictive fit

Just for the linear, generated data, lets take an interactive look at the posterior predictive fit for the models k1 through k5.

As indicated by the likelhood plots above, the higher-order polynomial models exhibit some quite wild swings in the function in order to (over)fit the data

Compare models using WAIC

The Widely Applicable Information Criterion (WAIC) can be used to calculate the goodness-of-fit of a model using numerical techniques. See {cite:t}watanabe2010asymptotic for details.

Observe:

We get three different measurements:

waic: widely applicable information criterion (or "Watanabe–Akaike information criterion")
waic_se: standard error of waic
p_waic: effective number parameters

In this case we are interested in the WAIC score. We also plot error bars for the standard error of the estimated scores. This gives us a more accurate view of how much they might differ.

Observe

We should prefer the model(s) with higher WAIC
Linear-generated data (lhs):
- The WAIC seems quite flat across models
- The WAIC seems best (highest) for simpler models.
Quadratic-generated data (rhs):
- The WAIC is also quite flat across the models
- The worst WAIC is for k1, it is not flexible enough to properly fit the data.
- WAIC is quite flat for the rest, but the highest is for k2 as should be and it decreases as the order increases. The higher the order the higher the complexity of the model, but the goodness of fit is basically the same. As models with higher complexity are penalized we can see how we land at the sweet spot of choosing the simplest model that can fit the data.

Compare leave-one-out Cross-Validation [LOO]

Leave-One-Out Cross-Validation or K-fold Cross-Validation is another quite universal approach for model selection. However, to implement K-fold cross-validation we need to partition the data repeatedly and fit the model on every partition. It can be very time consumming (computation time increase roughly as a factor of K). Here we are applying the numerical approach using the posterior trace as suggested in {cite:t}vehtari2017practical

Observe

We should prefer the model(s) with higher LOO. You can see that LOO is nearly identical with WAIC. That's because WAIC is asymptotically equal to LOO. However, PSIS-LOO is supposedly more robust than WAIC in the finite case (under weak priors or influential observation).
Linear-generated data (lhs):
- The LOO is also quite flat across models
- The LOO is also seems best (highest) for simpler models.
Quadratic-generated data (rhs):
- The same pattern as the WAIC

Final remarks and tips

It is important to keep in mind that, with more data points, the real underlying model (one that we used to generate the data) should outperform other models.

There is some agreement that PSIS-LOO offers the best indication of a model's quality. To quote from avehtari's comment: "I also recommend using PSIS-LOO instead of WAIC, because it's more reliable and has better diagnostics as discussed in {cite:t}vehtari2017practical, but if you insist to have one information criterion then leave WAIC".

Alternatively, Watanabe says "WAIC is a better approximator of the generalization error than the pareto smoothing importance sampling cross validation. The Pareto smoothing cross validation may be the better approximator of the cross validation than WAIC, however, it is not of the generalization error".

References

:::{bibliography} :filter: docname in docnames

ando2007bayesian spiegelhalter2002bayesian :::

:::{seealso}

Thomas Wiecki's detailed response to a question on Cross Validated
Cross-validation FAQs by Aki Vehtari :::

Authors

Authored by Jon Sedar on January, 2016 (pymc#930)
Updated by Junpeng Lao on July, 2017 (pymc#2398)
Re-executed by Ravin Kumar on May, 2019 (pymc#3397)
Re-executed by Alex Andorra and Michael Osthege on June, 2020 (pymc#3955)
Updated by Raul Maldonado on March, 2021 (pymc-examples#24)
Updated by Abhipsha Das and Oriol Abril on June, 2021 (pymc-examples#173)
Updated by Chris Fonnesbeck on December, 2024 (pymc-examples#761)

Watermark

:::{include} ../page_footer.md :::