(hsgp-advanced)=

Gaussian Processes: HSGP Advanced Usage

:::{post} June 28, 2024 :tags: gaussian process :category: reference, intermediate :author: Bill Engels, Alexandre Andorra, Maxim Kochurov :::

The Hilbert Space Gaussian processes approximation is a low-rank GP approximation that is particularly well-suited to usage in probabilistic programming languages like PyMC. It approximates the GP using a pre-computed and fixed set of basis functions that don't depend on the form of the covariance kernel or its hyperparameters. It's a parametric approximation, so prediction in PyMC can be done as one would with a linear model via pm.Data or pm.set_data. You don't need to define the .conditional distribution that non-parameteric GPs rely on. This makes it much easier to integrate an HSGP, instead of a GP, into your existing PyMC model. Additionally, unlike many other GP approximations, HSGPs can be used anywhere within a model and with any likelihood function.

It's also fast. The computational cost for unapproximated GPs per MCMC step is $\mathcal{O}(n^3)$, where $n$ is the number of data points. For HSGPs, it is $\mathcal{O}(mn + m)$, where $m$ is the number of basis vectors. It's important to note that sampling speeds are also very strongly determined by posterior geometry.

The HSGP approximation does carry some restrictions:

  1. It can only be used with stationary covariance kernels such as the Matern family. The {class}~pymc.gp.HSGP class is compatible with any Covariance class that implements the power_spectral_density method. There is a special case made for the Periodic covariance, which is implemented in PyMC by The {class}~pymc.gp.HSGPPeriodic.
  2. It does not scale well with the input dimension. The HSGP approximation is a good choice if your GP is over a one dimensional process like a time series, or a two dimensional spatial point process. It's likely not an efficient choice where the input dimension is larger than three.
  3. It may struggle with more rapidly varying processes. If the process you're trying to model changes very quickly relative to the extent of the domain, the HSGP approximation may fail to accurately represent it. We'll show in later sections how to set the accuracy of the approximation, which involves a trade-off between the fidelity of the approximation and the computational complexity.
  4. For smaller data sets, the full unapproximated GP may still be more efficient.

A secondary goal of this implementation is flexibility via an accessible implementation where the core computations are implemented in a modular way. For basic usage, users can use the .prior and .conditional methods and essentially treat the HSGP class as a drop in replacement for pm.gp.Latent, the unapproximated GP. More advanced users can bypass those methods and work with .prior_linearized instead, which exposes the HSGP as a parametric model. For more complex models with multiple HSGPs, users can work directly with functions like pm.gp.hsgp_approx.calc_eigenvalues and pm.gp.hsgp_approx.calc_eigenvectors.

References:

Example 1: A hierarchical HSGP, a more custom model

{admonition}

The {class}~pymc.gp.HSGP class and associated functions are also meant to be clear and hackable to enable building more complicated models. In the following example we fit a hierarchical HSGP, where each of the individual GPs (indexed by $i$) can have different lengthscales. The model is:

fμGP(0,Kμ(x,x;ημ,μ))fiGP(fμ,Kδ(x,x;ηδ,iδ))\begin{align} f^\mu &\sim \mathcal{GP}\left(0 \,, K^\mu(x, x' \,; \eta^\mu, \ell^\mu) \right) \\ f_i &\sim \mathcal{GP}\left(f^\mu \,, K^\delta(x, x' \,; \eta^\delta, \ell^\delta_i) \right) \\ \end{align}

There are two scale parameters $\eta^\mu$ and $\eta^\delta$. $\eta^\mu$ controls the overall scaling of the group GP, and $\eta^\delta$ controls the strength of the partial pooling of the $f_i$ to $f^\mu$. Each of the $i$ GPs can have its own lengthscale $\ell^\delta_i$. In the example below we simulate additive Gaussian noise, but this HSGP model will of course work with any likelihood anywhere within your model.

Refer to this section if you're interested in:

  1. Seeing an example of a fast approximation to a Hierarchical GP.
  2. Seeing how to construct more advanced and custom GP models.
  3. Using HSGPs for prediction within larger PyMC models.

Simulate data

Let's simulate a one-dimensional GP observed at 300 locations (200 used for training, the remaining 100 for testing), across the range from 0 to 15. You'll see there is a lot going on in the code below, so let's break down the gist of what's happening.

Defining the Mean GP

  • Long-Term Trend GP: A GP with a Matérn covariance function, characterized by a larger length scale (ell_mu_trend_true = 10), models the long-term linear trend in the data. The magnitude of variability in this trend is controlled by eta_mu_trend_true, which is also quite big relative to the other components, making this trend important in the data generating process.

  • Short-Term Variations GP: Another GP, also using a Matérn covariance function but with a shorter length scale (ell_mu_short_true = 1.5), captures more rapid fluctuations in the data. This is controlled by eta_mu_short_true.

  • The overall mean GP (cov_mu) is the sum of these two GPs, combining long-term trends and short-term variations.

Delta GPs for Hierarchical Modeling

We define several (10 in this case) delta GPs, each intended to capture different deviations from the mean GP. These are characterized by a length scale drawn from a log-normal distribution centered at the length scale of the short-term mean GP, ell_mu_short_true.

The amount of diversity between the delta GPs is controlled by eta_delta_true: the bigger it is, the more diverse from each other the delta GPs -- kind of like the sigma parameter in a hierarchical model (see {ref}A Primer on Bayesian Methods for Multilevel Modeling <multilevel_modeling>).

Helper function

Now we can define a function to generate observations from this data-generating structure. generate_gp_samples generates samples from the mean GP, adds contributions from each delta GP, and incorporates noise, producing a set of observations that reflect both underlying processes and observational noise.

This function is used to generate both the full set of GP realizations (f_mu_true_full, f_true_full) and the observed data (y_full).

Generate samples for the full data

Now we can call the function and generate data! The sampled data (both the underlying GP realizations and the noisy observations) are split according to the earlier defined training and testing segments. This setup allows for the evaluation of model predictions against unseen data, mimicking real-world scenarios where models are trained on a subset of available data.

Visualize generated data

Build the model

To build this model to allow different lengthscales per GP, we need to rewrite the power spectral density. The one attached to the PyMC covariance classes, i.e. pm.gp.cov.Matern52.power_spectral_density, is vectorized over the input dimension, but we need one vectorized across GPs.

Fortunately, this one at least isn't too hard to adapt:

Adapting the power spectral density

Next, we build a function that constructs the hierarchical GP. Notice that it assumes some names for the dims, but our goal is to provide a simple foundation that you can adapt to your specific use-case. You can see that this is a bit more deconstructed than .prior_linearized.

Coding the hierarchical GP

One of the added complexities is modeling the additive GPs for the mean GP (long term trend + short term variation). The cool thing is that HSGP is compatible with additive covariances, meaning that we don't have to define two completely independent HSGPs.

Instead, we can sum the two independent power spectral densities, and then create a single GP from the combined power spectral densities. This reduces the number of unknown parameters because the two GPs can share the same basis vectors and basis coefficients.

Essentially, this amounts to creating two independent covariance functions, and just adding them before defining the HSGP object -- instead of defining two independent HSGP objects.

If we were able to use the high-level {class}~pymc.gp.HSGP class, the code for this would look like:

python

Choosing the HSGP parameters

Next, we use the heuristics to choose m and c:

That actually looks a bit too low, especially c. We can actually check the computation by hand. The way we defined hierarchical_HSGP, it needs the 0-centered x_train data, called Xs, so we'll need to do that here (we'll also need to do that later when we define the model):

Then we can use the c from above and check the implied L, which is the result of set_boundary:

And this is indeed too low. How do we know? Well, thankfully, L has a pretty interpretable meaning in the HSGP decomposition. It is the boundary of the approximation, so we need to chose L such that the domain [-L, L] contains all points, not only in x_train, but in x_full (see the first tutorial [blocked] for more details).

So we want $L > 15$ in this case, which means we need to increase c until we're satisfied:

Bingo!

One last thing we also talked about in the first tutorial: increasing c requires increasing m to compensate for the loss of fidelity at smaller lengthscales. So let's err on the side of safety and choose:

Setting up the model

As discussed, you'll see we're handling the 0-centering of X before defining the GP. When you're using pm.HSGP or prior_linearized, you don't need to care about that, as it's done for you under the hood. But when using more advanced models like this one, you need to get your hands dirtier as you need to access lower-level functions of the package.

Prior predictive checks

Now, what do these priors mean? Good question. As always, it's crucial to do prior predictive checks, especially for GPs, where amplitudes and lenghtscales can be very hard to infer:

Once we're satisfied with our priors, which is the case here, we can... sample the model!

Sampling & Convergence checks

Everything went great here, that's really good sign! Now let's see if the model could recover the true parameters.

Posterior checks

Really good job -- the model recovered everything decently!

And we can see the GP parameters were well informed by the data. Let's close up this section by updating our prior predictive plot with the posterior of the inferred GPs:

That looks great! Now we can go ahead and predict out of sample.

Out-of-sample predictions

This looks good! And we can check our predictions make sense with another plot:

Phew, that's a lot of information! Let's see what we can make of this:

  • As data become sparse, the long-term trend is reverting back to the overall GP mean (i.e 0), but hasn't reached it yet, because the length scale on the trend is bigger than the testing period of 5 (ell_mu_trend_true = 10).
  • The short-term variation on the mean GP isn't obvious because it's small relative to the trend. But it is noticeable: it creates the small wiggles in the orange HDI, and makes this HDI wider in comparison to the individual GPs (the blue ones).
  • The individual GPs revert faster to the mean GP (orange envelope) than to the GP mean (i.e 0), which is the behavior we want from the hierarchical structure.

Example 2: An HSGP that exploits Kronecker structure

This example is a multiple GP model like the previous one, but it assumes a different relationship between the GPs. Instead of pooling towards a common mean GP, there is an additional covariance structure that specifies their relationship.

For example, we may have time series measurements of temperature from multiple weather stations. The similarity over time should mostly depend only on the distance between the weather stations. They all will likely have the same dynamics, or same covariance structure, over time. You can think of this as local partial pooling.

In the example below, we arrange the GPs along a single "spatial" axis, so it's a 1D problem and not 2D, and then allow them to share the same time covariance. This might be clearer after taking a look at the simulated data below.

Mathematically, this model uses the {ref}Kronecker product <GP-Kron>, where the "space" and "time" dimensions are separable.

K=KxKtK = K_{x} \otimes K_{t}

If there are $n_t$ time points and $n_x$ GPs, then the resulting $K$ matrix will have dimension $n_x \cdot n_t \times n_x \cdot n_t$. Using a regular GP, this would be $\mathcal{O}(n_t^3 n_x^3)$. So, we can achieve a pretty massive speed-up by both taking advantage of Kronecker structure and using the HSGP approximation. It isn't required that both of the dimensions (in this example, space and time) use the HSGP approximation. It's possible to use either a vanilla GP or inducing points for the "spatial" covariance, and the HSGP approximation in time. In the example below, both use the HSGP approximation.

Refer to this section if you're interested in:

  1. Seeing an example of exploiting Kronecker structure and the HSGP approximation.
  2. Seeing how to construct more advanced and custom GP models.

Data generation

Kronecker GP specification

PyMC Model

Next, we use the heuristics to choose m and c:

Prior predictive checks

Sampling & Convergence checks

Posterior predictive checks

And isn't this beautiful?? Now go on, and HSGP-on!

Authors

Watermark

:::{include} ../page_footer.md :::