variational_api_quickstart.ipynb

https://github.com/pymc-devs/pymc-examples/blob/main/examples/variational_inference/variational_api_quickstart.ipynb

(variational_api_quickstart)=

Introduction to Variational Inference with PyMC

The most common strategy for computing posterior quantities of Bayesian models is via sampling, particularly Markov chain Monte Carlo (MCMC) algorithms. While sampling algorithms and associated computing have continually improved in performance and efficiency, MCMC methods still scale poorly with data size, and become prohibitive for more than a few thousand observations. A more scalable alternative to sampling is variational inference (VI), which re-frames the problem of computing the posterior distribution as an optimization problem.

In PyMC, the variational inference API is focused on approximating posterior distributions through a suite of modern algorithms. Common use cases to which this module can be applied include:

Sampling from model posterior and computing arbitrary expressions
Conducting Monte Carlo approximation of expectation, variance, and other statistics
Removing symbolic dependence on PyMC random nodes and evaluate expressions (using eval)
Providing a bridge to arbitrary PyTensor code

:::{post} Jan 13, 2023 :tags: variational inference :category: intermediate, how-to :author: Maxim Kochurov, Chris Fonnesbeck :::

Distributional Approximations

There are severa methods in statistics that use a simpler distribution to approximate a more complex distribution. Perhaps the best-known example is the Laplace (normal) approximation. This involves constructing a Taylor series of the target posterior, but retaining only the terms of quadratic order and using those to construct a multivariate normal approximation.

Similarly, variational inference is another distributional approximation method where, rather than leveraging a Taylor series, some class of approximating distribution is chosen and its parameters are optimized such that the resulting distribution is as close as possible to the posterior. In essence, VI is a deterministic approximation that places bounds on the density of interest, then uses opimization to choose from that bounded set.

Basic setup

We do not need complex models to play with the VI API; let's begin with a simple mixture model:

We can't compute analytical expectations for this model. However, we can obtain an approximation using Markov chain Monte Carlo methods; let's use NUTS first.

To allow samples of the expressions to be saved, we need to wrap them in Deterministic objects:

Above are traces for $x^2$ and $sin(x)$. We can see there is clear multi-modality in this model. One drawback, is that you need to know in advance what exactly you want to see in trace and wrap it with Deterministic.

The VI API takes an alternate approach: You obtain inference from model, then calculate expressions based on this model afterwards.

Let's use the same model:

Here we will use automatic differentiation variational inference (ADVI).

Notice that ADVI has failed to approximate the multimodal distribution, since it uses a Gaussian distribution that has a single mode.

Checking convergence

Let's use the default arguments for CheckParametersConvergence as they seem to be reasonable.

We can access inference history via .hist attribute.

This is not a good convergence plot, despite the fact that we ran many iterations. The reason is that the mean of the ADVI approximation is close to zero, and therefore taking the relative difference (the default method) is unstable for checking convergence.

That's much better! We've reached convergence after less than 5000 iterations.

Tracking parameters

Another useful callback allows users to track parameters. It allows for the tracking of arbitrary statistics during inference, though it can be memory-hungry. Using the fit function, we do not have direct access to the approximation before inference. However, tracking parameters requires access to the approximation. We can get around this constraint by using the object-oriented (OO) API for inference.

Different approximations have different hyperparameters. In mean-field ADVI, we have $\rho$ and $\mu$ (inspired by Bayes by BackProp).

There are convenient shortcuts to relevant statistics associated with the approximation. This can be useful, for example, when specifying a mass matrix for NUTS sampling:

We can roll these statistics into the Tracker callback.

Now, calling advi.fit will record the mean and standard deviation of the approximation as it runs.

We can now plot both the evidence lower bound and parameter traces:

Notice that there are convergence issues with the mean, and that lack of convergence does not seem to change the ELBO trajectory significantly. As we are using the OO API, we can run the approximation longer until convergence is achieved.

Let's take a look:

We still see evidence for lack of convergence, as the mean has devolved into a random walk. This could be the result of choosing a poor algorithm for inference. At any rate, it is unstable and can produce very different results even using different random seeds.

Let's compare results with the NUTS output:

Again, we see that ADVI is not able to cope with multimodality; we can instead use SVGD, which generates an approximation based on a large number of particles.

That did the trick, as we now have a multimodal approximation using SVGD.

With this, it is possible to calculate arbitrary functions of the parameters with this variational approximation. For example we can calculate $x^2$ and $sin(x)$, as with the NUTS model.

To evaluate these expressions with the approximation, we need approx.sample_node.

Every call yields a different value from the same node. This is because it is stochastic.

By applying replacements, we are now free of the dependence on the PyMC model; instead, we now depend on the approximation. Changing it will change the distribution for stochastic nodes:

There is a more convenient way to get lots of samples at once: sample_node

The sample_node function includes an additional dimension, so taking expectations or calculating variance is specified by axis=0.

A symbolic sample size can also be specified:

Unfortunately the size must be a scalar value.

Multilabel logistic regression

Let's illustrate the use of Tracker with the famous Iris dataset. We'll attempy multi-label classification and compute the expected accuracy score as a diagnostic.

A relatively simple model will be sufficient here because the classes are roughly linearly separable; we are going to fit multinomial logistic regression.

Applying replacements in practice

PyMC models have symbolic inputs for latent variables. To evaluate an expression that requires knowledge of latent variables, one needs to provide fixed values. We can use values approximated by VI for this purpose. The function sample_node removes the symbolic dependencies.

sample_node will use the whole distribution at each step, so we will use it here. We can apply more replacements in single function call using the more_replacements keyword argument in both replacement functions.

HINT: You can use more_replacements argument when calling fit too:

pm.fit(more_replacements={full_data: minibatch_data})

inference.fit(more_replacements={full_data: minibatch_data})

By applying the code above, we now have 100 sampled probabilities (default number for sample_node is None) for each observation.

Next we create symbolic expressions for sampled accuracy scores:

Tracker expects callables so we can pass .eval method of PyTensor node that is function itself.

Calls to this function are cached so they can be reused.

Training does not seem to be working here. Let's use a different optimizer and boost the learning rate.

This is much better!

So, Tracker allows us to monitor our approximation and choose good training schedule.

Minibatches

When dealing with large datasets, using minibatch training can drastically speed up and improve approximation performance. Large datasets impose a hefty cost on the computation of gradients.

There is a nice API in PyMC to handle these cases, which is available through the pm.Minibatch class. The minibatch is just a highly specialized PyTensor tensor.

To demonstrate, let's simulate a large quantity of data:

For comparison, let's fit a model without minibatch processing:

Just for fun, let's create a custom special purpose callback to halt slow optimization. Here we define a callback that causes a hard stop when approximation runs too slowly:

Inference is too slow, taking several seconds per iteration; fitting the approximation would have taken hours!

Now let's use minibatches. At every iteration, we will draw 500 random values:

Remember to set total_size in observed

total_size is an important parameter that allows PyMC to infer the right way of rescaling densities. If it is not set, you are likely to get completely wrong results. For more information please refer to the comprehensive documentation of pm.Minibatch.

Minibatch inference is dramatically faster. Multidimensional minibatches may be needed for some corner cases where you do matrix factorization or model is very wide.

Here is the docstring for Minibatch to illustrate how it can be customized.

Authors

Authored by Maxim Kochurov
Updated by Chris Fonnesbeck (pymc-examples#429)

Watermark

:::{include} ../page_footer.md :::