(GLM-robust-with-outlier-detection)=
:::{post} 17 Nov, 2021 :tags: outliers, regression, robust :category: intermediate :author: Jon Sedar, Thomas Wiecki, Raul Maldonado, Oriol Abril :::
Using PyMC for Robust Regression with Outlier Detection using the Hogg 2010 Signal vs Noise method.
Modelling concept:
Complementary approaches:
generalized_linear_models/GLM-robust, and that approach is also comparedpm.Deterministic <- this is really nicehogg2010dataivezić2014astroMLtext,vanderplas2012astroMLSee the original project README for full details on dependencies and about the environment where the notebook was written in. A summary on the environment where this notebook was executed is available in the "Watermark" section.
:::{include} ../extra_installs.md :::
We'll use the Hogg 2010 data available at https://github.com/astroML/astroML/blob/master/astroML/datasets/hogg2010test.py
It's a very small dataset so for convenience, it's hardcoded below
Exploratory Data Analysis
Note:
pymc3 parthogg2010data for more detailObserve:
It's common practice to standardize the input values to a linear model, because this leads to coefficients
sitting in the same range and being more directly comparable. e.g. this is noted in {cite:t}gelman2008scaling
So, following Gelman's paper above, we'll divide by 2 s.d. here
scikit-learn FunctionTransformerrho_xy for nowAdditional note on scaling the output feature y and measurement error sigma_y:
sigma_y_out from sigma_ysigma_y to be restated in the same scale as the stdev of yStandardize (mean center and divide by 2 sd):
Before we get more advanced, I want to demo the fit of a simple linear model with Normal likelihood function. The priors are also Normally distributed, so this behaves like an OLS with Ridge Regression (L2 norm).
Note: the dataset also has sigma_x and rho_xy available, but for this exercise, We've chosen to only use sigma_y
where:
1 + xsigma_y for each datapointNote we are purposefully missing a step here for prior predictive checks.
NOTE: We will illustrate this OLS fit and compare to the datapoints in the final comparison plot
Traceplot
Plot posterior joint distribution (since the model has only 2 coeffs, we can easily view this as a 2D joint distribution)
I've added this brief section in order to directly compare the Student-T based method exampled in Thomas Wiecki's notebook in the PyMC3 documentation
Instead of using a Normal distribution for the likelihood, we use a Student-T which has fatter tails. In theory this allows outliers to have a smaller influence in the likelihood estimation. This method does not produce inlier / outlier flags (it marginalizes over such a classification) but it's simpler and faster to run than the Signal Vs Noise model below, so a comparison seems worthwhile.
In this modification, we allow the likelihood to be more robust to outliers (have fatter tails)
where:
1 + xsigma_y for each datapointNote: the dataset also has sigma_x and rho_xy available, but for this exercise, I've chosen to only use sigma_y
NOTE: We will illustrate this StudentT fit and compare to the datapoints in the final comparison plot
Traceplot
Plot posterior joint distribution
Observe:
beta parameters appear to have greater variance than in the OLS regressionnu ~ 1, indicating
that a fat-tailed likelihood has a better fit than a thin-tailed onebeta[intercept] has moved much closer to $0$, which is
interesting: if the theoretical relationship y ~ f(x) has no offset,
then for this mean-centered dataset, the intercept should indeed be $0$: it
might easily be getting pushed off-course by outliers in the OLS model.beta[slope] has accordingly become greater: perhaps moving
closer to the theoretical function f(x)Please read the paper (Hogg 2010) and Jake Vanderplas' code for more complete information about the modelling technique.
The general idea is to create a 'mixture' model whereby datapoints can be described by either:
The likelihood is evaluated over a mixture of two likelihoods, one for 'inliers', one for 'outliers'. A Bernoulli distribution is used to randomly assign datapoints in N to either the inlier or outlier groups, and we sample the model as usual to infer robust model parameters and inlier / outlier flags:
where:
1 + xsigma_y for each datapointThis notebook uses {func}~pymc3.model.Potential class combined with logp to create a likelihood and build this model where a feature is not observed, here the Bernoulli switching variable.
Usage of Potential is not discussed. Other resources are available that are worth referring to for details
on Potential usage:
pymc3.Potential tag on the left sidebarNote that pm.sample conveniently and automatically creates the compound sampling process to:
is_outlier) using a discrete samplerFurther note:
jitter+adapt_diagkwargs to a particular stepper, wrap them in a dict addressed to the lowercased name of the stepper e.g. nuts={'target_accept': 0.85}We will illustrate this model fit and compare to the datapoints in the final comparison plot
Observe:
target_accept = 0.8 there are lots of divergences, indicating this is not a particularly stable modeltarget_accept = 0.9 (and increasing tune from 5000 to 10000), the traces exhibit fewer divergences and appear slightly better behaved.beta parameters, and for outlier model parameter y_est_out (the mean) look reasonably convergedy_sigma_out (the additional pooled variance) occasionally go a bit wildfrac_outliers is so dispersed: that's quite a flat distribution: suggests that there are a few datapoints where their inlier/outlier status is subjectiveSimple trace summary inc rhat
Plot posterior joint distribution
(This is a particularly useful diagnostic in this case where we see a lot of divergences in the traces: maybe the model specification leads to weird behaviours)
Observe:
hogg_inlier and studentt models converge to similar ranges for
b0_intercept and b1_slope, indicating that the (unshown) hogg_outlier
model might perform a similar job to the fat tails of the studentt model:
allowing greater log probability away from the main linear distribution in the datapointshogg_inlier posterior has thinner
tails and more probability mass concentrated about the central valueshogg_inlier model also appears to have moved farther away from both the
ols and studentt models on the b0_intercept, suggesting that the outliers
really distort that particular dimensionAt each step of the traces, each datapoint may be either an inlier or outlier. We hope that the datapoints spend an unequal time being one state or the other, so let's take a look at the simple count of states for each of the 20 datapoints.
Observe:
frac_outliers ~ 0.27, corresponding to approx 5 or 6 of the 20 datapoints: we might investigate datapoints [p1, p12, p16] to see if they lean towards being outliersThe 95% cutoff we choose is subjective and arbitrary, but I prefer it for now, so let's declare these 3 to be outliers and see how it looks compared to Jake Vanderplas' outliers, which were declared in a slightly different way as points with means above 0.68.
Note:
Also add flag for points to be investigated. Will use this to annotate final plot
Observe:
The posterior preditive fit for:
We see that the Hogg Signal vs Noise model also yields specific estimates of which datapoints are outliers:
Overall:
:::{bibliography} :filter: docname in docnames :::
pm.Normal.dist().logp() and pm.Potential()nu in StudentT model to be more efficient, drop explicit use of theano shared vars, generally improve plotting / explanations / layoutarviz.InferenceData objects, running on pymc=3.11, arviz=0.11.0az.extract by Benjamin Vincent in February, 2023 (pymc-examples#522):::{include} ../page_footer.md :::