GP-Births.ipynb

https://github.com/pymc-devs/pymc-examples/blob/main/examples/gaussian_processes/GP-Births.ipynb

(gp-birthdays)=

Baby Births Modelling with HSGPs

:::{post} January, 2024 :tags: gaussian process, hilbert space approximation, :category: intermediate, how-to :author: Juan Orduz :::

This notebook provides an example of using the Hilbert Space Gaussian Process (HSGP) technique, introduced in {cite:p}solin2020Hilbert, in the context of time series modeling. This technique has proven successful in speeding up models with Gaussian process components.

To illustrate the main concepts, we use the classic birthdays example dataset (see {cite:p}gelman2013bayesian [Chapter 21] and here for a comment on the data source) and reproduce one of the models presented in the excellent case study {cite:p}vehtari2022Birthdays by Aki Vehtari (you can find the Stan code on this repository). In his exposition, the author presents an extensive iterative approach to analyze the relative number of births per day in the USA from 1969-1988 using HSGPs for various components: the long-term trend, seasonal, weekly, day the year, and special floating day variation. As this resource is very detailed and gives many relevant explanations, we do not reproduce the whole process but instead focus on reproducing one of the intermediate models. Namely, the model with a slow trend, yearly seasonal trend, and day-of-week components (Model 3 in the original case study). The reason for reproducing a simpler model than the final one is to make this an introductory notebook for users willing to learn about this technique. We will provide a subsequent example where we implement the final model with all components.

In this notebook, we do not want to deep-dive into the mathematical details but rather focus on the implementation and how to use PyMC's {class}~pymc.gp.HSGP and {class}~pymc.gp.HSGPPeriodic API. This class provides a convenient way of using HSGPs in PyMC models. The user needs to input certain parameters to control the number of terms in the approximation and the domain of definition. Of course, understanding what these parameters do is important, so let's briefly touch upon the main idea of the approximation and the most relevant parameters:

The main idea of the approximation

Recall that a kernel (associated with a covariance function) is the main ingredient of a Gaussian process as it encodes a measure of similarity (and smoothness) between points (see {ref}GP-MeansAndCovs). The Hilbert space approximation idea is to decompose such kernel as a linear combination of an orthonormal basis so that when replacing the kernel by this expansion, we can fit a linear model in terms of these basis functions. Sampling from a truncated expansion will be much faster than the vanilla Gaussian process formulation. The key observation is that the basis functions in the approximation do not depend on the hyperparameters of the covariance function for the Gaussian process, allowing the computations to speed up tremendously.

Where does the Hilbert Space come from? It turns out that the orthonormal basis comes from the spectral decomposition of the Laplace operator on a compact set (think about the Fourier decomposition on the circle, for example). In other words, the basis functions are eigenvectors of the Laplace operator on the square-integrable-functions space $L^{2}([-L, L])$, which is a Hilbert Space. Returning to the class {class}~pymc.gp.HSGP, the two most important parameters are:

$m$: The number of basis vectors to use in the approximation.
$L$: The boundary of the space of definition. Choose L such that the domain $[-L, L]$ contains all points in the domain. (Note that the compact set is the closed interval $[-L, L]$ 😉)

One can also use a proportion extension factor $c > 0$ used to construct $L$ from the domain of definition of the Gaussian process $X$. Concretely, $L$ can be specified as the product $cS$, where $S = \max|X|$.

We recommend the paper {cite:p}riutort2022PracticalHilbertSpaceApproximate for a practical discussion of this technique.

{note}

:::{include} ../extra_installs.md :::

Read Data

We read the data from the repository Bayesian workflow book - Birthdays by Aki Vehtari.

The data set contains the number of births per day in USA in the period 1969-1988. All the columns are self-explanatory except for day_of_year2 which is the day of the year (from 1 to 366) with leap day being 60 and 1st March 61 also on non-leap-years.

EDA and Feature Engineering

First, we look into the births distribution:

We create a couple of features:

A datestamp.
births_relative100: the number of births relative to $100$.
time: data index.

{note}

Now, let's look into the development over time of the relative births, which is the target variable we will model.

We see a clear long term trend component and a clear yearly seasonality. We also see how the variance grows with time, this is known as heteroscedasticity.

The plot above has many many data points and we want to make sure we understand seasonal patterns at different levels (which might be hidden in the plot above). Hence, we systematically check seasonality at various levels.

Let's continue looking by averaging over the day of the year:

Overall, we see a relatively smooth behavior with the exception of certain holidays (Memorial Day, Thanksgiving and Labor Day) and the new year's day.

Next, we split by month and year to see if we spot any changes in the pattern over time.

Besides the global trend, we do not see any clear differences between months.

We continue looking into the weekly seasonality.

It seems that there are on average less births during the weekend.

{admonition}

Data Pre-Processing

After having a better understanding of the data and the patterns we want to capture with the model, we can proceed to pre-process the data.

Extract relevant features

We want to work on the normalized log scale of the relative births. The reason for this is to work on a scale where it is easier to set up priors (scaled space) and so that the heteroscedasticity is reduced (log transform).

Model Specification

Model Components

In this example notebook, we implement the Model 3: Slow trend + yearly seasonal trend + day of the week described in {cite:p}vehtari2022Birthdays. The EDA above should help us understand the motivation behind each of the following components of the model:

Global trend. We use a Gaussian process with an exponential quadratic kernel.
Periodicity over years: We use a Gaussian process with a periodic kernel. Observe that, since we are working on the normalized scale, the period should be period=365.25 / time_std (and not period=365.25 !).
Weekly seasonality: We use a normal distribution on the day of the week one-hot-encoded values. As the data is standardized, in particular centered around zero, we do not need to add an intercept term. In addition, we set the coefficient of Monday to zero to avoid identifiability issues.
Likelihood: We use a Gaussian distribution.

For all of the Gaussian processes components we use the Hilbert Space Gaussian Process (HSGP) approximation.

Prior Specifications

Most of the priors are not very informative. The only tricky part here is to think that we are working on the normalized log scale of the relative births data. For example, for the global trend we use a Gaussian process with an exponential quadratic kernel. We use the following priors for the length scale:

The motivation is that we have around $7.3$K data points and we want to consider the in between data points distance in the normalized scale. That is why we consider the ratio 7_000 / time_str. Note that we want to capture the long term trend, so we want to consider a length scale that is larger than the data points distance. We increase the order of magnitude by dividing by $10$. Finally, since a {class}~pymc.distributions.continuous.LogNormal distribution has positive support and a common choice for length scales, we take a log-transform on the resulting quantity 700 / time_str so ensure the mean of the prior is close to this value.

Model Implementation

We now specify the model in PyMC.

{tip}

{attention}

Prior Predictive Checks

We run the model with the prior predictive checks to see if the model is able to generate data in a similar scale as the data.

It looks very reasonable as the prior samples are within a reasonable range of the observed data.

Model Fitting and Diagnostics

We now proceed to fit the model using the NumPyro sampler. It takes around $5$ minutes to run the model locally (Intel MacBook Pro, $4$ cores, $16$ GB RAM).

Diagnostics

We do not see any divergences or very high r-hat values:

We can also look into the trace plots.

{note}

Posterior Distribution Analysis

Now we want to do a deep dive into the posterior distribution of the model and its components. We want to do this in the original scale. Therefore the first step is to transform the posterior samples back to the original scale. For that purpose we use the following utility function (the code is not important).

Model Components

Likelihood

We start by plotting the likelihood.

It looks that we are capturing the global variation. Let’s look into the posterior distribution plot to get a better understanding of the model.

This does not seem very good as there is a pretty big discrepancy between black line and shaded blue in the bulk of posterior, tails look good. This suggests we might be missing some covariates. We explore this in a latter more complex model.

To get a better understanding of the model fit, we need to look into the individual components.

Model Components

Next, we visualize each of the main components of the model. We write a utility function to do this.

Global Trend

Yearly Periodicity

Global Trend plus Yearly Periodicity

If we want to combine the global trend and the yearly periodicity, we can't simply sum the components in the original scale as we would be adding the mean term twice. Instead we need to first sum the posterior samples and then take the inverse transform (these operation do not commute!).

Conclusion

We hope you better understand HSGPs and how to use them in practice with the very convenient PyMC's API. It's great to be able to strategically fold GPs into larger models. It's "possible" with GPs, but HSGPs make that actually possible. The reason is that the complexity of each GP component the is reduced by the approximation from $\mathcal{O}(n^3)$ to $\mathcal{O}(nm + m)$, where $m$ is the number of basis functions used in the approximation. This is a huge speedup!

{admonition}

In a future notebook, we will present a more complete model to compare with Vehtari's results. Stay tuned!

Authors

Authored by Juan Orduz in January 2024

Acknowledgements

I would like to thank Alex Andorra and Bill Engels for their valuable feedback and suggestions during the writing of this notebook.

References

:::{bibliography} :filter: docname in docnames :::

Watermark

:::{include} ../page_footer.md :::