HSGP-Basic.ipynb

https://github.com/pymc-devs/pymc-examples/blob/main/examples/gaussian_processes/HSGP-Basic.ipynb

(hsgp)=

Gaussian Processes: HSGP Reference & First Steps

:::{post} June 10, 2024 :tags: gaussian process :category: reference, intermediate :author: Bill Engels, Alexandre Andorra :::

The Hilbert Space Gaussian processes approximation is a low-rank GP approximation that is particularly well-suited to usage in probabilistic programming languages like PyMC. It approximates the GP using a pre-computed and fixed set of basis functions that don't depend on the form of the covariance kernel or its hyperparameters. It's a parametric approximation, so prediction in PyMC can be done as one would with a linear model via pm.Data or pm.set_data. You don't need to define the .conditional distribution that non-parameteric GPs rely on. This makes it much easier to integrate an HSGP, instead of a GP, into your existing PyMC model. Additionally, unlike many other GP approximations, HSGPs can be used anywhere within a model and with any likelihood function.

It's also fast. The computational cost for unapproximated GPs per MCMC step is $\mathcal{O}(n^3)$, where $n$ is the number of data points. For HSGPs, it is $\mathcal{O}(mn + m)$, where $m$ is the number of basis vectors. It's important to note that sampling speeds are also very strongly determined by posterior geometry.

The HSGP approximation does carry some restrictions:

It can only be used with stationary covariance kernels such as the Matern family. The {class}~pymc.gp.HSGP class is compatible with any Covariance class that implements the power_spectral_density method. There is a special case made for the Periodic covariance, which is implemented in PyMC by The {class}~pymc.gp.HSGPPeriodic.
It does not scale well with the input dimension. The HSGP approximation is a good choice if your GP is over a one dimensional process like a time series, or a two dimensional spatial point process. It's likely not an efficient choice where the input dimension is larger than three.
It may struggle with more rapidly varying processes. If the process you're trying to model changes very quickly relative to the extent of the domain, the HSGP approximation may fail to accurately represent it. We'll show in later sections how to set the accuracy of the approximation, which involves a trade-off between the fidelity of the approximation and the computational complexity.
For smaller data sets, the full unapproximated GP may still be more efficient.

A secondary goal of this implementation is flexibility via an accessible implementation where the core computations are implemented in a modular way. For basic usage, users can use the .prior and .conditional methods and essentially treat the HSGP class as a drop in replacement for pm.gp.Latent, the unapproximated GP. More advanced users can bypass those methods and work with .prior_linearized instead, which exposes the HSGP as a parametric model. For more complex models with multiple HSGPs, users can work directly with functions like pm.gp.hsgp_approx.calc_eigenvalues and pm.gp.hsgp_approx.calc_eigenvectors.

References:

Original reference: Solin & Sarkka, 2019.
HSGPs in probabilistic programming languages: Riutort-Mayol et al., 2020.
HSTPs (Student-t process): Sellier & Dellaportas, 2023.
Kronecker HSGPs: Dan et al., 2022
PyMC's {class}~pymc.gp.HSGP API

Example 1: Basic HSGP Usage

We'll use simulated data to motivate an overview of the usage of {class}~pymc.gp.HSGP. Refer to this section if you're interested in:

Seeing a simple example of HSGP in action.
Replacing a standard GP, i.e. pm.gp.Latent, with a faster approximation -- as long as you're using one of the more common covariance kernels, like ExpQuad, Matern52 or Matern32.
Understanding when to use the centered or the non-centered parameterization.
A quick example of additive GPs

Simulate data

Define and fit the HSGP model

First we use pz.maxent to choose our prior for the lengthscale parameter, maxent return the maximum entropy prior with the specified mass within the interval [lower, upper].

We use a Lognormal to penalize very small lengthscales while having a heavy right tail. When the signal from the GP is high relative to the noise, we are able to use more informative priors.

There are a few things to note about the model code below:

The approximation parameters m and c control the approximation fidelity to computational complexity tradeoff. We'll see in a later section how to choose these values. In short, choosing a larger m helps improve the approximation of smaller lengthscales and other short distance variations that the GP has to fit. Choosing a larger c helps improve the approximation of longer and slower changes.
We chose the centered parameterization because the true underlying GP is strongly informed by the data. You can read more about centered vs. non-centered here and here. In the HSGP class, the default is non-centered, which works better for the, arguably more common, case where the underlying GP is weakly informed by the observed data.

Fitting went all good, so we can go ahead and plot the inferred GP, as well as the posterior predictive samples.

Posterior predictive plot

The inferred underlying GP (in bordeaux) accurately matches the true underlying GP (in yellow). We also see that the posterior predictive samples (in light blue) fit the observed data really well.

{admonition}

Choosing the HSGP approximation parameters, `m`, `L`, and `c`

Before fitting a model with an HSGP, you have to choose m and c or L. m is the number of basis vectors. Recall that the computational complexity of the HSGP approximation is $\mathcal{O}(mn + m)$, where $n$ is the number of data points.

This choice is a balance between three concerns:

The accuracy of the approximation.
Reducing the computational burden.
The X locations where predictions or forecasts will need to be made.

At the end of this section, we'll give the rules of thumb given in Ruitort-Mayol et. al.. The best way to understand how to choose these parameters is to understand how m, c and L relate to each other, which requires understanding a bit more about how the approximation works under the hood.

How `L` and `c` affect the basis

Speaking non-technically, the HSGP approximates the GP prior as a linear combination of sinusoids. The coefficients of the linear combination are IID normal random variables whose standard deviation depends on GP hyperparameters (which are an amplitude and lengthscale for the Matern family). Users are who are interested in further introductory details should see this fantastic blog post by Juan Ordiz.

To see this, we'll make a few plots of the $m=3$ and $m=5$ basis vectors and pay careful attention to how they behave at the boundaries of the domain. Note that we have to center the x data first, and then choose L in relation to the centered data. It's worth mentioning here that the basis vectors we're plotting do not depend on either the choice of the covariance kernel or on any unknown parameters the covariance function has.

The first and middle panels have 3 basis functions, and the rightmost has 5.

Notice that both L and m are specified as lists, to allow setting L and m per input dimension. In this example these are both one element lists since our example is in a one dimensional, time series like context. Before continuing, it's helpful to define $S$ as the half range of the centered data, or the distance from the midpoint at $x=0$ to the edge, $x=5$. In this example $S=5$ for each plot panel. Then, we can define $c$ such that it relates $S$ to $L$,

L = c \cdot S \,.

It's usually easier to set $L$ by choosing $c$, which acts as a multiplier on $S$.

In the left-most plot we chose $L=S=5$, which is exactly on the edge of our x locations. For any $m$, all the basis vectors are forced to pinch to zero at the edges, at $x=-5$ and $x=5$. This means that the HSGP approximation becomes poor as you get closer to $x=-5$ and $x=5$. How quickly depends on the lengthscale. Large lengthscales require larger values of $L$ and $c$, and smaller lengthscales attenuate this issue. Ruitort-Mayol et. al. recommend using 1.2 as a minimum value. The effect of this choice on the basis vectors is shown in the center panel. In particular, we can now see that the basis vectors are not forced to pinch to zero.

The right panel shows the effect of choosing a larger $L$, or setting $c=4$. Larger values of $L$ or $c$ make the boundary conditions less problematic, and are required to accurately approximate GPs with longer lengthscales. You also need to consider where predictions will need to be made. In addition to the locations of the observed $x$ values, the locations of the new $x$ locations also need to be away from the "pinch" caused by the boundary condition. The period of the basis functions also increases as we increase $L$ or $c$. This means that we will need to increase $m$, the number of basis vectors, in order to compensate if we wish to approximate GPs with smaller lengthscales.

With large $L$ or $c$, the first eigenvector can flatten so much that it becomes partially or completely unidentifiable with the intercept in the model. The right-most panel is an example of this (see the blue basis vector). It can be very beneficial to sampling to drop the first eigenvector in these situations. The HSGP and HSGPPeriodic class in PyMC both have the option drop_first to do this, or if you're using .prior_linearized you can control this yourself. Be sure to check the basis vectors if the sampler is having issues.

To summarize:

Increasing $m$ helps the HSGP approximate GPs with smaller lengthscales, at the cost of increasing computational complexity.
Increasing $c$ or $L$ helps the HSGP approximate GPs with larger lengthscales, but may require increasing $m$ to compensate for the loss of fidelity at smaller lengthscales.
When choosing $m$, $c$ or $L$, it's important to consider the locations where you will need to make predictions, such that they also aren't affected by the boundary condition.
The first basis vector may be unidentified with the intercept, especially when $L$ or $c$ are larger.

Heuristics for choosing $m$ and $c$

In practice, you'll need to infer the lengthscale from the data, so the HSGP needs to approximate a GP across a range of lengthscales that are representative of your chosen prior. You'll need to choose $c$ large enough to handle the largest lengthscales you might fit, and also choose $m$ large enough to accommodate the smallest lengthscales.

Ruitort-Mayol et. al. give some handy heuristics for the range of lengthscales that are accurately reproduced for given values of $m$ and $c$. Below, we provide a function that uses their heuristics to recommend minimum $m$ and $c$ value. Note that these recommendations are based on a one-dimensional GP.

For example, if you're using the Matern52 covariance and your data ranges from $x=-5$ to $x=95$, and the bulk of your lengthscale prior is between $\ell=1$ and $\ell=50$, then the smallest recommended values are $m=543$ and $c=4.1$, as you can see below:

The HSGP approximate Gram matrix

You may not be able to rely on these heuristics for a few reasons. You may be using a different covariance function than ExpQuad, Matern52, or Matern32. Also, they're only defined for one dimensional GPs. Another way to check HSGP fidelity is to directly compare the unapproximated Gram matrix (the Gram matrix is the matrix obtained after calculating the covariance function over the inputs X), $\mathbf{K}$, to the one resulting from the HSGP approximation,

\tilde{\mathbf{K}} = \Phi \Delta \Phi^T \,,

where $\Phi$ is the matrix of eigenvectors we use as the basis (plotted previously), and $\Delta$ has the spectral densities computed at the eigenvalues down the diagonal. Below we show an example with a two dimensional grid of input X. It's important to notice that the HSGP approximation requires us to center the input X data, which is done by converting X to Xs in the code below. We plot the approximate Gram matrix for varying $L$ and $c$ values, to see when the approximation starts to degrade for the given X locations and lengthscale choices.

The plots above compare the approximate Gram matrices to the unapproximated Gram matrix in the top left panel. The goal is to compare the approximated Gram matrices to the true one (upper left). Qualitatively, the more similar they look the better the approximation. Also, these results are only relevant to the context of the particular domain defined by X and the chosen lengthscale, $\ell = 3$ -- just because it looks good for $\ell = 3$ doesn't mean it will look good for, for instance, $\ell = 10$.

We can make a few observations:

The approximation visually looks good for the two panels with $m = 15$ or $m = 30$, and with $c=5.0$. The rest show clear differences to the unapproximated Gram matrix.
$c=1.2$ is generally too small, regardless of $m$.
Perhaps surprisingly, the $m=[2, 2]$, $c=1.2$ approximation looks better than the $m=[2, 2]$, $c=5$ one. As we showed earlier, when we "stretch" the eigenvector basis to fill a larger domain than our X (larger by the multiple $c$), we can lose fidelity at smaller lengthscales. In other words, in the second case. $m$ is too small for the value of $c$. That's why the first option looks better.
The second row ($c=1.2$) doesn't really improve as $m$ increases. That's because $m$ is good enough to capture the smaller lengthscales, but $c$ is always too small to capture the larger lengthscales.
The first row on the other hand shows that $c=5$ is good enough for the larger lengthscales, and once we hit $m=15$ we're also able to capture the smaller ones.

For your particular situation, you will need to experiment across your range of lengthscales and quantify how much approximation error is acceptable. Often, when prototyping a model, you can use a lower fidelity HSGP approximation for faster sampling. Then, once you understand the range of relevant lengthscales, you can dial in the correct $m$ and $L$ (or $c$) values.

Be aware that it's also possible to encounter scenarios where a low fidelity HSGP approximation gives a more parsimonious fit than a high fidelity HSGP approximation. A low fidelity HSGP approximation is still a valid prior for some unknown function, if somewhat contrived. Whether that matters will depend on your context.

:::{dropdown} Avoiding underflow issues :icon: alert-fill :color: info As noted above, the diagonal matrix $\Delta$ used in the calculation of the approximate Gram matrix contains information on the power spectral density, $\mathcal{S}$, of a given kernel. Thus, for the Gram matrix to be defined, $\mathcal{S} > 0$. Consequently, when picking HSGP hyperparameters $m$ and $L$ it is important to check $\mathcal{S} > 0$ for the suggested $m$ and $L$ values. The code in the next few cell compares the suitability of the suggested hyperparameters $m$ and $L$ for matern52 to that of ExpQuad for the data spanning $x=-5$ to $x=95$, with the lengthscale prior between $\ell=1$ and $\ell=50$. As we shall see, the suggested hyperparameters for ExpQuad are for not suitable for $\ell=50$.

Matern $\mathbf{\nu = 5/2}$

pycon

Squared exponential

pycon

We see that not all values of $\mathcal{S}$ are defined for the squared exponential kernel when $\ell=50$.

To see why, the covariance of the kernels considered are plotted below along with their power spectral densities in log space. The covariance plot shows that for a set $\ell$, the tails of matern52 are heavier than ExpQuad, while a higher $\ell$ for a given kernel type gives rise to higher covariance. The power spectral density is inversely proportional to the covariance - essentially the flatter the shape of the covariance function, the narrower the bandwidth and the lower the power spectral density at higher values of $\omega$. As a result, we see that for ExpQuad with $\ell = 50$, $\mathcal{S}\left(\omega\right)$ rapidly decreases towards $0$ before the domain of $\omega$ is exhausted, and hence we reach values at which we underflow to $0$.

python

[Image blocked: alt text] These underflow issues can arise when using a broad prior on $\ell$ as you need an $m$ large enough to cover small lengthscales, but these may cause underflow in $\mathcal{S}$ when $\ell$ is large. As the graphs above suggest, one can consider a different kernel with heavier tails such as matern52 or matern32.

Alternatively, if you are certain you need a specific kernel, you can use the linear form of HSGPs (see below) with a boolean mask. In doing so, the sinusoids with vanishingly small coefficients in the linear combination are effectively screened out. E.g:

pycon

:::

Example 2: Working with HSGPs as a parametric, linear model

One of the main benefits of the HSGP approximation is the ability to integrate it into existing models, especially if you need to do prediction in new x-locations after sampling. Unlike other GP implementations in PyMC, you can bypass the .prior and .conditional API, and instead use HSGP.prior_linearized, which allows you to use pm.Data and pm.set_data for making predictions.

Refer to this section if you're interested in:

Seeing a two dimensional, or spatial, HSGP example with other predictors in the model.
Using HSGPs for prediction within larger PyMC models.
Convert your HSGP approximation into an HSTP approximation, or an approximation to a TP, or Student-t process.

Data generation

As expected, we clearly see that the test set is in the region where $2 < x1 < 4$.

Here is the model structure corresponding to our generative scenario. Below we describe its main components.

Model structure

Before sampling and looking at the results, there are a few things to pay attention to in the model above.

Setting the coefficients, centered and non-centered

First, prior_linearized returns the eigenvector basis, phi, and the square root of the power spectrum at the eigenvalues, sqrt_psd. You have to construct the HSGP approximation from these. The following are the relevant lines of code, showing both the centered and non-centered parameterization.

python

Be sure to set the size of basis_coeffs using the n_basis_vectors attribute of the HSGP object (or the number of columns of phi), $m^* = \prod_i m_i$. In the above example, $m^* = 30 \cdot 30 = 900$, and is the total number of basis vectors used in the approximation.

Approximating a TP instead of a GP

We can slightly modify the code above to obtain a Student-t process,

python

where we use a $\text{Gamma}(\alpha=2, \beta=0.1)$ prior for $\nu$, which places around 50% probability that $\nu > 30$, the point where a Student-T roughly becomes indistinguishable from a Gaussian. See this link for more information.

Results

Now, let's sample the model and quickly check the results:

Sampling went great, but, interestingly, we seem to have a bias in the posterior for sigma. It's not the focus of this notebook, but it'd be interesting to dive into this in a real use-case.

Out-of-sample predictions

Then, we can just use pm.set_data to make predictions at new points. We'll show the fit and the predictions together in the plot below.

Sampling diagnostics all look good, and we can see that the underlying GP was inferred nicely. We can also see the increase in uncertainty outside of our training data as a horizontal stripe in the middle panel, showing the increased standard deviation of the inferred GP here.

Authors

Created by Bill Engels and Alexandre Andorra in 2024 (pymc-examples#647)
Use pz.maxent instead of pm.find_constrained_prior, and add random seed. Osvaldo Martin. August 2024
Use pm.gp.util.stabilize in simulate_1d. Use pz.maxent rather than pm.find_constrained_prior in linearized HSGP model. Added comparison between matern52 and ExpQuad power spectral densities. Alexander Armstrong, July-August 2025.

Watermark

:::{include} ../page_footer.md :::

HSGP-Basic.ipynb

Gaussian Processes: HSGP Reference & First Steps

References:

Example 1: Basic HSGP Usage

Simulate data

Define and fit the HSGP model

Posterior predictive plot

Choosing the HSGP approximation parameters, m, L, and c

How L and c affect the basis

Heuristics for choosing $m$ and $c$

The HSGP approximate Gram matrix

Example 2: Working with HSGPs as a parametric, linear model

Data generation

Model structure

Setting the coefficients, centered and non-centered

Approximating a TP instead of a GP

Results

Out-of-sample predictions

Authors

Watermark

Choosing the HSGP approximation parameters, `m`, `L`, and `c`

How `L` and `c` affect the basis