GLM-robust-with-outlier-detection.ipynb

https://github.com/pymc-devs/pymc-examples/blob/main/examples/generalized_linear_models/GLM-robust-with-outlier-detection.ipynb

(GLM-robust-with-outlier-detection)=

GLM: Robust Regression using Custom Likelihood for Outlier Classification

:::{post} 17 Nov, 2021 :tags: outliers, regression, robust :category: intermediate :author: Jon Sedar, Thomas Wiecki, Raul Maldonado, Oriol Abril :::

Using PyMC for Robust Regression with Outlier Detection using the Hogg 2010 Signal vs Noise method.

Modelling concept:

This model uses a custom likelihood function as a mixture of two likelihoods, one for the main data-generating function (a linear model that we care about), and one for outliers.
The model does not marginalize and thus gives us a classification of outlier-hood for each datapoint
The dataset is tiny and hardcoded into this Notebook. It contains errors in both the x and y, but we will deal here with only errors in y.

Complementary approaches:

This is a complementary approach to the Student-T robust regression as illustrated in the example {doc}generalized_linear_models/GLM-robust, and that approach is also compared
See also a gist by Dan FM that he published after a quick twitter conversation - his "Hogg improved" model uses this same model structure and cleverly marginalizes over the outlier class but also observes it during sampling using a pm.Deterministic <- this is really nice
The likelihood evaluation is essentially a copy of eqn 17 in "Data analysis recipes: Fitting a model to data" - {cite:t}hogg2010data
The model is adapted specifically from Jake Vanderplas' and Brigitta Sipocz' implementation in the AstroML book {cite:p}ivezić2014astroMLtext,vanderplas2012astroML

Setup

Installation Notes

See the original project README for full details on dependencies and about the environment where the notebook was written in. A summary on the environment where this notebook was executed is available in the "Watermark" section.

:::{include} ../extra_installs.md :::

Imports

Load Data

We'll use the Hogg 2010 data available at https://github.com/astroML/astroML/blob/master/astroML/datasets/hogg2010test.py

It's a very small dataset so for convenience, it's hardcoded below

1. Basic EDA

Exploratory Data Analysis

Note:

this is very rudimentary so we can quickly get to the pymc3 part
the dataset contains errors in both the x and y, but we will deal here with only errors in y.
see the {cite:t}hogg2010data for more detail

Observe:

Even judging just by eye, you can see these observations mostly fall on / around a straight line with positive gradient
It looks like a few of the datapoints may be outliers from such a line
Measurement error (independently on x and y) varies across the observations

2. Basic Feature Engineering

2.1 Transform and standardize dataset

It's common practice to standardize the input values to a linear model, because this leads to coefficients sitting in the same range and being more directly comparable. e.g. this is noted in {cite:t}gelman2008scaling

So, following Gelman's paper above, we'll divide by 2 s.d. here

since this model is very simple, we just standardize directly, rather than using e.g. a scikit-learn FunctionTransformer
ignoring rho_xy for now

Additional note on scaling the output feature y and measurement error sigma_y:

This is unconventional - typically you wouldn't scale an output feature
However, in the Hogg model we fit a custom two-part likelihood function of Normals which encourages a globally minimised log-loss by allowing outliers to fit to their own Normal distribution. This outlier distribution is specified using a stdev stated as an offset sigma_y_out from sigma_y
This offset value has the effect of requiring sigma_y to be restated in the same scale as the stdev of y

Standardize (mean center and divide by 2 sd):

3. Simple Linear Model with no Outlier Correction

3.1 Specify Model

Before we get more advanced, I want to demo the fit of a simple linear model with Normal likelihood function. The priors are also Normally distributed, so this behaves like an OLS with Ridge Regression (L2 norm).

Note: the dataset also has sigma_x and rho_xy available, but for this exercise, We've chosen to only use sigma_y

$\hat{y} \sim \mathcal{N}(\beta^{T} \vec{x}_{i}, \sigma_{i})$

where:

$\beta$ = ${1, \beta_{j \in X_{j}}}$ <--- linear coefs in $X_{j}$, in this case 1 + x
$\sigma$ = error term <--- in this case we set this to an unpooled $\sigma_{i}$: the measured error sigma_y for each datapoint

3.2 Fit Model

Note we are purposefully missing a step here for prior predictive checks.

3.2.1 Sample Posterior

3.2.2 View Diagnostics

NOTE: We will illustrate this OLS fit and compare to the datapoints in the final comparison plot

Traceplot

Plot posterior joint distribution (since the model has only 2 coeffs, we can easily view this as a 2D joint distribution)

4. Simple Linear Model with Robust Student-T Likelihood

I've added this brief section in order to directly compare the Student-T based method exampled in Thomas Wiecki's notebook in the PyMC3 documentation

Instead of using a Normal distribution for the likelihood, we use a Student-T which has fatter tails. In theory this allows outliers to have a smaller influence in the likelihood estimation. This method does not produce inlier / outlier flags (it marginalizes over such a classification) but it's simpler and faster to run than the Signal Vs Noise model below, so a comparison seems worthwhile.

4.1 Specify Model

In this modification, we allow the likelihood to be more robust to outliers (have fatter tails)

$\hat{y} \sim \text{StudentT}(\beta^{T} \vec{x}_{i}, \sigma_{i}, \nu)$

where:

$\beta$ = ${1, \beta_{j \in X_{j}}}$ <--- linear coefs in $X_{j}$, in this case 1 + x
$\sigma$ = error term <--- in this case we set this to an unpooled $\sigma_{i}$: the measured error sigma_y for each datapoint
$\nu$ = degrees of freedom <--- allowing a pdf with fat tails and thus less influence from outlier datapoints

Note: the dataset also has sigma_x and rho_xy available, but for this exercise, I've chosen to only use sigma_y

4.2 Fit Model

4.2.1 Sample Posterior

4.2.2 View Diagnostics

NOTE: We will illustrate this StudentT fit and compare to the datapoints in the final comparison plot

Traceplot

Plot posterior joint distribution

4.2.3 View the shift in posterior joint distributions from OLS to StudentT

Observe:

Both beta parameters appear to have greater variance than in the OLS regression
This is due to $\nu$ appearing to converge to a value nu ~ 1, indicating that a fat-tailed likelihood has a better fit than a thin-tailed one
The parameter beta[intercept] has moved much closer to $0$, which is interesting: if the theoretical relationship y ~ f(x) has no offset, then for this mean-centered dataset, the intercept should indeed be $0$: it might easily be getting pushed off-course by outliers in the OLS model.
The parameter beta[slope] has accordingly become greater: perhaps moving closer to the theoretical function f(x)

5. Linear Model with Custom Likelihood to Distinguish Outliers: Hogg Method

Please read the paper (Hogg 2010) and Jake Vanderplas' code for more complete information about the modelling technique.

The general idea is to create a 'mixture' model whereby datapoints can be described by either:

the proposed (linear) model (thus a datapoint is an inlier), or
a second model, which for convenience we also propose to be linear, but allow it to have a different mean and variance (thus a datapoint is an outlier)

5.1 Specify Model

The likelihood is evaluated over a mixture of two likelihoods, one for 'inliers', one for 'outliers'. A Bernoulli distribution is used to randomly assign datapoints in N to either the inlier or outlier groups, and we sample the model as usual to infer robust model parameters and inlier / outlier flags:

\mathcal{logL} = \sum_{i}^{i=N} log \left[ \frac{(1 - B_{i})}{\sqrt{2 \pi \sigma_{in}^{2}}} exp \left( - \frac{(x_{i} - \mu_{in})^{2}}{2\sigma_{in}^{2}} \right) \right] + \sum_{i}^{i=N} log \left[ \frac{B_{i}}{\sqrt{2 \pi (\sigma_{in}^{2} + \sigma_{out}^{2})}} exp \left( - \frac{(x_{i}- \mu_{out})^{2}}{2(\sigma_{in}^{2} + \sigma_{out}^{2})} \right) \right]

where:

$B_{i}$ is Bernoulli-distributed $B_{i} \in {0_{(inlier)},1_{(outlier)}}$
$\mu_{in} = \beta^{T} \vec{x}{i}$ as before for inliers, where $\beta$ = ${1, \beta{j \in X_{j}}}$ <--- linear coefs in $X_{j}$, in this case 1 + x
$\sigma_{in}$ = noise term <--- in this case we set this to an unpooled $\sigma_{i}$: the measured error sigma_y for each datapoint
$\mu_{out}$ <--- is a random pooled variable for outliers
$\sigma_{out}$ = additional noise term <--- is a random unpooled variable for outliers

This notebook uses {func}~pymc3.model.Potential class combined with logp to create a likelihood and build this model where a feature is not observed, here the Bernoulli switching variable.

Usage of Potential is not discussed. Other resources are available that are worth referring to for details on Potential usage:

Junpenglao's presentation on likelihoods at PyData Berlin July 2018
worked examples on Discourse and Cross Validated.
and the pymc3 port of CamDP's Probabilistic Programming and Bayesian Methods for Hackers, Chapter 5 Loss Functions, Example: Optimizing for the Showcase on The Price is Right
Other examples using it, search for the pymc3.Potential tag on the left sidebar

5.2 Fit Model

5.2.1 Sample Posterior

Note that pm.sample conveniently and automatically creates the compound sampling process to:

sample a Bernoulli variable (the class is_outlier) using a discrete sampler
sample the continuous variables using a continuous sampler

Further note:

This also means we can't initialise using ADVI, so will init using jitter+adapt_diag
In order to pass kwargs to a particular stepper, wrap them in a dict addressed to the lowercased name of the stepper e.g. nuts={'target_accept': 0.85}

5.2.2 View Diagnostics

We will illustrate this model fit and compare to the datapoints in the final comparison plot

Observe:

At the default target_accept = 0.8 there are lots of divergences, indicating this is not a particularly stable model
However, at target_accept = 0.9 (and increasing tune from 5000 to 10000), the traces exhibit fewer divergences and appear slightly better behaved.
The traces for the inlier model beta parameters, and for outlier model parameter y_est_out (the mean) look reasonably converged
The traces for outlier model param y_sigma_out (the additional pooled variance) occasionally go a bit wild
It's interesting that frac_outliers is so dispersed: that's quite a flat distribution: suggests that there are a few datapoints where their inlier/outlier status is subjective
Indeed as Thomas noted in his v2.0 Notebook, because we're explicitly modeling the latent label (inlier/outlier) as binary choice the sampler could have a problem - rewriting this model into a marginal mixture model would be better.

Simple trace summary inc rhat

Plot posterior joint distribution

(This is a particularly useful diagnostic in this case where we see a lot of divergences in the traces: maybe the model specification leads to weird behaviours)

5.2.3 View the shift in posterior joint distributions from OLS to StudentT to Hogg

Observe:

The hogg_inlier and studentt models converge to similar ranges for b0_intercept and b1_slope, indicating that the (unshown) hogg_outlier model might perform a similar job to the fat tails of the studentt model: allowing greater log probability away from the main linear distribution in the datapoints
As expected, (since it's a Normal) the hogg_inlier posterior has thinner tails and more probability mass concentrated about the central values
The hogg_inlier model also appears to have moved farther away from both the ols and studentt models on the b0_intercept, suggesting that the outliers really distort that particular dimension

5.3 Declare Outliers

5.3.1 View ranges for inliers / outlier predictions

At each step of the traces, each datapoint may be either an inlier or outlier. We hope that the datapoints spend an unequal time being one state or the other, so let's take a look at the simple count of states for each of the 20 datapoints.

Observe:

The plot above shows the proportion of samples in the traces in which each datapoint is marked as an outlier, expressed as a percentage.
3 points [p2, p3, p4] spend >=95% of their time as outliers
Note the mean posterior value of frac_outliers ~ 0.27, corresponding to approx 5 or 6 of the 20 datapoints: we might investigate datapoints [p1, p12, p16] to see if they lean towards being outliers

The 95% cutoff we choose is subjective and arbitrary, but I prefer it for now, so let's declare these 3 to be outliers and see how it looks compared to Jake Vanderplas' outliers, which were declared in a slightly different way as points with means above 0.68.

5.3.2 Declare outliers

Note:

I will declare outliers to be datapoints that have value == 1 at the 5-percentile cutoff, i.e. in the percentiles from 5 up to 100, their values are 1.
Try for yourself altering cutoff to larger values, which leads to an objective ranking of outlier-hood.

Also add flag for points to be investigated. Will use this to annotate final plot

5.4 Posterior Prediction Plots for OLS vs StudentT vs Hogg "Signal vs Noise"

Observe:

The posterior preditive fit for:

the OLS model is shown in Green and as expected, it doesn't appear to fit the majority of our datapoints very well, skewed by outliers
the Student-T model is shown in Orange and does appear to fit the 'main axis' of datapoints quite well, ignoring outliers
the Hogg Signal vs Noise model is shown in two parts:
- Blue for inliers fits the 'main axis' of datapoints well, ignoring outliers
- Red for outliers has a very large variance and has assigned 'outlier' points with more log likelihood than the Blue inlier model

We see that the Hogg Signal vs Noise model also yields specific estimates of which datapoints are outliers:

17 'inlier' datapoints, in Blue and
3 'outlier' datapoints shown in Red.
From a simple visual inspection, the classification seems fair, and agrees with Jake Vanderplas' findings.
I've annotated these Red and the most outlying inliers to aid visual investigation

Overall:

the Hogg Signal vs Noise model behaves as promised, yielding a robust regression estimate and explicit labelling of inliers / outliers, but
the Hogg Signal vs Noise model is quite complex, and whilst the regression seems robust, the traceplot shoes many divergences, and the model is potentially unstable
if you simply want a robust regression without inlier / outlier labelling, the Student-T model may be a good compromise, offering a simple model, quick sampling, and a very similar estimate.

References

:::{bibliography} :filter: docname in docnames :::

Authors

Authored and adapted for this collection by Jon Sedar (jonsedar) on December, 2015. It was originally posted in jonsedar/pymc3_examples
Updated by Thomas Wiecki (twiecki) on July, 2018
- Restate outlier model using pm.Normal.dist().logp() and pm.Potential()
Updated by Jon Sedar on November, 2019
- Restate nu in StudentT model to be more efficient, drop explicit use of theano shared vars, generally improve plotting / explanations / layout
Updated by Jon Sedar on May, 2020
- Tidyup language, formatting, plots and warnings and rerun with pymc=3.8, arviz=0.7
Updated by Raul Maldonado (CloudChaoszero) on April, 2021
- Tidyup language, formatting, set MultiTrace objects to arviz.InferenceData objects, running on pymc=3.11, arviz=0.11.0
Updated by Raul Maldonado on May, 2021
- Update Visualizations from Matplotlib explicit calls to Arviz visualizations. objects, running on pymc=3.11, arviz=0.11.0
Updated by Oriol Abril on November, 2021
Updated to PyMC v5 and to use az.extract by Benjamin Vincent in February, 2023 (pymc-examples#522)

Watermark

:::{include} ../page_footer.md :::