(hsgp-advanced)=
:::{post} June 28, 2024 :tags: gaussian process :category: reference, intermediate :author: Bill Engels, Alexandre Andorra, Maxim Kochurov :::
The Hilbert Space Gaussian processes approximation is a low-rank GP approximation that is particularly well-suited to usage in probabilistic programming languages like PyMC. It approximates the GP using a pre-computed and fixed set of basis functions that don't depend on the form of the covariance kernel or its hyperparameters. It's a parametric approximation, so prediction in PyMC can be done as one would with a linear model via pm.Data or pm.set_data. You don't need to define the .conditional distribution that non-parameteric GPs rely on. This makes it much easier to integrate an HSGP, instead of a GP, into your existing PyMC model. Additionally, unlike many other GP approximations, HSGPs can be used anywhere within a model and with any likelihood function.
It's also fast. The computational cost for unapproximated GPs per MCMC step is $\mathcal{O}(n^3)$, where $n$ is the number of data points. For HSGPs, it is $\mathcal{O}(mn + m)$, where $m$ is the number of basis vectors. It's important to note that sampling speeds are also very strongly determined by posterior geometry.
The HSGP approximation does carry some restrictions:
~pymc.gp.HSGP class is compatible with any Covariance class that implements the power_spectral_density method. There is a special case made for the Periodic covariance, which is implemented in PyMC by The {class}~pymc.gp.HSGPPeriodic.A secondary goal of this implementation is flexibility via an accessible implementation where the core computations are implemented in a modular way. For basic usage, users can use the .prior and .conditional methods and essentially treat the HSGP class as a drop in replacement for pm.gp.Latent, the unapproximated GP. More advanced users can bypass those methods and work with .prior_linearized instead, which exposes the HSGP as a parametric model. For more complex models with multiple HSGPs, users can work directly with functions like pm.gp.hsgp_approx.calc_eigenvalues and pm.gp.hsgp_approx.calc_eigenvectors.
~pymc.gp.HSGP APIThe {class}~pymc.gp.HSGP class and associated functions are also meant to be clear and hackable to enable building more complicated models. In the following example we fit a hierarchical HSGP, where each of the individual GPs (indexed by $i$) can have different lengthscales. The model is:
There are two scale parameters $\eta^\mu$ and $\eta^\delta$. $\eta^\mu$ controls the overall scaling of the group GP, and $\eta^\delta$ controls the strength of the partial pooling of the $f_i$ to $f^\mu$. Each of the $i$ GPs can have its own lengthscale $\ell^\delta_i$. In the example below we simulate additive Gaussian noise, but this HSGP model will of course work with any likelihood anywhere within your model.
Refer to this section if you're interested in:
Let's simulate a one-dimensional GP observed at 300 locations (200 used for training, the remaining 100 for testing), across the range from 0 to 15. You'll see there is a lot going on in the code below, so let's break down the gist of what's happening.
Long-Term Trend GP: A GP with a Matérn covariance function, characterized by a larger length scale (ell_mu_trend_true = 10), models the long-term linear trend in the data. The magnitude of variability in this trend is controlled by eta_mu_trend_true, which is also quite big relative to the other components, making this trend important in the data generating process.
Short-Term Variations GP: Another GP, also using a Matérn covariance function but with a shorter length scale (ell_mu_short_true = 1.5), captures more rapid fluctuations in the data. This is controlled by eta_mu_short_true.
The overall mean GP (cov_mu) is the sum of these two GPs, combining long-term trends and short-term variations.
We define several (10 in this case) delta GPs, each intended to capture different deviations from the mean GP. These are characterized by a length scale drawn from a log-normal distribution centered at the length scale of the short-term mean GP, ell_mu_short_true.
The amount of diversity between the delta GPs is controlled by eta_delta_true: the bigger it is, the more diverse from each other the delta GPs -- kind of like the sigma parameter in a hierarchical model (see {ref}A Primer on Bayesian Methods for Multilevel Modeling <multilevel_modeling>).
Now we can define a function to generate observations from this data-generating structure. generate_gp_samples generates samples from the mean GP, adds contributions from each delta GP, and incorporates noise, producing a set of observations that reflect both underlying processes and observational noise.
This function is used to generate both the full set of GP realizations (f_mu_true_full, f_true_full) and the observed data (y_full).
Now we can call the function and generate data! The sampled data (both the underlying GP realizations and the noisy observations) are split according to the earlier defined training and testing segments. This setup allows for the evaluation of model predictions against unseen data, mimicking real-world scenarios where models are trained on a subset of available data.
To build this model to allow different lengthscales per GP, we need to rewrite the power spectral density. The one attached to the PyMC covariance classes, i.e. pm.gp.cov.Matern52.power_spectral_density, is vectorized over the input dimension, but we need one vectorized across GPs.
Fortunately, this one at least isn't too hard to adapt:
Next, we build a function that constructs the hierarchical GP. Notice that it assumes some names for the dims, but our goal is to provide a simple foundation that you can adapt to your specific use-case. You can see that this is a bit more deconstructed than .prior_linearized.
One of the added complexities is modeling the additive GPs for the mean GP (long term trend + short term variation). The cool thing is that HSGP is compatible with additive covariances, meaning that we don't have to define two completely independent HSGPs.
Instead, we can sum the two independent power spectral densities, and then create a single GP from the combined power spectral densities. This reduces the number of unknown parameters because the two GPs can share the same basis vectors and basis coefficients.
Essentially, this amounts to creating two independent covariance functions, and just adding them before defining the HSGP object -- instead of defining two independent HSGP objects.
If we were able to use the high-level {class}~pymc.gp.HSGP class, the code for this would look like:
Next, we use the heuristics to choose m and c:
That actually looks a bit too low, especially c. We can actually check the computation by hand. The way we defined hierarchical_HSGP, it needs the 0-centered x_train data, called Xs, so we'll need to do that here (we'll also need to do that later when we define the model):
Then we can use the c from above and check the implied L, which is the result of set_boundary:
And this is indeed too low. How do we know? Well, thankfully, L has a pretty interpretable meaning in the HSGP decomposition. It is the boundary of the approximation, so we need to chose L such that the domain [-L, L] contains all points, not only in x_train, but in x_full (see the first tutorial [blocked] for more details).
So we want $L > 15$ in this case, which means we need to increase c until we're satisfied:
Bingo!
One last thing we also talked about in the first tutorial: increasing c requires increasing m to compensate for the loss of fidelity at smaller lengthscales. So let's err on the side of safety and choose:
As discussed, you'll see we're handling the 0-centering of X before defining the GP. When you're using pm.HSGP or prior_linearized, you don't need to care about that, as it's done for you under the hood. But when using more advanced models like this one, you need to get your hands dirtier as you need to access lower-level functions of the package.
Now, what do these priors mean? Good question. As always, it's crucial to do prior predictive checks, especially for GPs, where amplitudes and lenghtscales can be very hard to infer:
Once we're satisfied with our priors, which is the case here, we can... sample the model!
Everything went great here, that's really good sign! Now let's see if the model could recover the true parameters.
Really good job -- the model recovered everything decently!
And we can see the GP parameters were well informed by the data. Let's close up this section by updating our prior predictive plot with the posterior of the inferred GPs:
That looks great! Now we can go ahead and predict out of sample.
This looks good! And we can check our predictions make sense with another plot:
Phew, that's a lot of information! Let's see what we can make of this:
ell_mu_trend_true = 10).This example is a multiple GP model like the previous one, but it assumes a different relationship between the GPs. Instead of pooling towards a common mean GP, there is an additional covariance structure that specifies their relationship.
For example, we may have time series measurements of temperature from multiple weather stations. The similarity over time should mostly depend only on the distance between the weather stations. They all will likely have the same dynamics, or same covariance structure, over time. You can think of this as local partial pooling.
In the example below, we arrange the GPs along a single "spatial" axis, so it's a 1D problem and not 2D, and then allow them to share the same time covariance. This might be clearer after taking a look at the simulated data below.
Mathematically, this model uses the {ref}Kronecker product <GP-Kron>, where the "space" and "time" dimensions are separable.
If there are $n_t$ time points and $n_x$ GPs, then the resulting $K$ matrix will have dimension $n_x \cdot n_t \times n_x \cdot n_t$. Using a regular GP, this would be $\mathcal{O}(n_t^3 n_x^3)$. So, we can achieve a pretty massive speed-up by both taking advantage of Kronecker structure and using the HSGP approximation. It isn't required that both of the dimensions (in this example, space and time) use the HSGP approximation. It's possible to use either a vanilla GP or inducing points for the "spatial" covariance, and the HSGP approximation in time. In the example below, both use the HSGP approximation.
Refer to this section if you're interested in:
Next, we use the heuristics to choose m and c:
And isn't this beautiful?? Now go on, and HSGP-on!
pz.maxent instead of pm.find_constrained_prior, and add random seed. Osvaldo Martin. August 2024:::{include} ../page_footer.md :::