(GLM-model-selection)=
:::{post} Jan 8, 2022 :tags: cross validation, generalized linear model, loo, model comparison, waic :category: intermediate :author: Jon Sedar, Junpeng Lao, Abhipsha Das, Oriol Abril-Pla :::
A fairly minimal reproducible example of Model Selection using WAIC, and LOO as currently implemented in PyMC.
This example creates two toy datasets under linear and quadratic models, and then tests the fit of a range of polynomial linear models upon those datasets by using Widely Applicable Information Criterion (WAIC), and leave-one-out (LOO) cross-validation using Pareto-smoothed importance sampling (PSIS).
The example was inspired by Jake Vanderplas' blogpost on model selection, although Cross-Validation and Bayes Factor comparison are not implemented. The datasets are tiny and generated within this Notebook. They contain errors in the measured value (y) only.
We start writing some functions to help with the rest of the notebook. Only the some functions are key to understanding the notebook, the rest are convenience functions to make plotting more concise when needed and are hidden inside a toggle-able section; it is still available but you need to click to see it.
Throughout the rest of the Notebook, we'll use two toy datasets created by a linear and a quadratic model respectively, so that we can better evaluate the fit of the model selection.
Right now, lets use an interactive session to play around with the data generation function in this Notebook, and get a feel for the possibilities of data we could generate.
where:
$i \in n$ datapoints
:::{admonition} Note on outliers
p to set the (approximate) proportion of 'outliers' under a bernoulli distribution.latent_sigma_yGLM-robust-with-outlier-detection
:::Observe:
latent_error in errorbars, but this is for interest only, since this shows the inherent noise in whatever 'physical process' we imagine created the data.We can use the above interactive plot to get a feel for the effect of the params. Now we'll create 2 fixed datasets to use for the remainder of the Notebook.
Scatterplot against model line
Observe:
df_lin and df_quad created by a linear model and quadratic model respectively.Create ranges for later ylim xim
This linear model is really simple and conventional, an OLS with L2 constraints (Ridge Regression):
Observe:
Bambi can be used for defining models using a formulae-style formula syntax. This seems really useful, especially for defining simple regression models in fewer lines of code.
Here's the same OLS model as above, defined using bambi.
Observe:
bambi-defined model appears to behave in a very similar way, and finds the same parameter values as the conventionally-defined model - any differences are due to the random nature of the sampling.bambi syntax for further models below, since it allows us to create a small model factory very easily.Back to the real purpose of this Notebook, to demonstrate model selection.
First, let's create and run a set of polynomial models on each of our toy datasets. By default this is for models of order 1 to 5.
We're creating 5 polynomial models and fitting each to the chosen dataset using the functions create_poly_modelspec and run_models below.
Just for the linear, generated data, lets take an interactive look at the posterior predictive fit for the models k1 through k5.
As indicated by the likelhood plots above, the higher-order polynomial models exhibit some quite wild swings in the function in order to (over)fit the data
The Widely Applicable Information Criterion (WAIC) can be used to calculate the goodness-of-fit of a model using numerical techniques. See {cite:t}watanabe2010asymptotic for details.
Observe:
We get three different measurements:
In this case we are interested in the WAIC score. We also plot error bars for the standard error of the estimated scores. This gives us a more accurate view of how much they might differ.
Observe
We should prefer the model(s) with higher WAIC
Linear-generated data (lhs):
Quadratic-generated data (rhs):
Leave-One-Out Cross-Validation or K-fold Cross-Validation is another quite universal approach for model selection. However, to implement K-fold cross-validation we need to partition the data repeatedly and fit the model on every partition. It can be very time consumming (computation time increase roughly as a factor of K). Here we are applying the numerical approach using the posterior trace as suggested in {cite:t}vehtari2017practical
Observe
We should prefer the model(s) with higher LOO. You can see that LOO is nearly identical with WAIC. That's because WAIC is asymptotically equal to LOO. However, PSIS-LOO is supposedly more robust than WAIC in the finite case (under weak priors or influential observation).
Linear-generated data (lhs):
Quadratic-generated data (rhs):
It is important to keep in mind that, with more data points, the real underlying model (one that we used to generate the data) should outperform other models.
There is some agreement that PSIS-LOO offers the best indication of a model's quality. To quote from avehtari's comment: "I also recommend using PSIS-LOO instead of WAIC, because it's more reliable and has better diagnostics as discussed in {cite:t}vehtari2017practical, but if you insist to have one information criterion then leave WAIC".
Alternatively, Watanabe says "WAIC is a better approximator of the generalization error than the pareto smoothing importance sampling cross validation. The Pareto smoothing cross validation may be the better approximator of the cross validation than WAIC, however, it is not of the generalization error".
:::{bibliography} :filter: docname in docnames
ando2007bayesian spiegelhalter2002bayesian :::
:::{seealso}
:::{include} ../page_footer.md :::