(DEMetropolis_comparisons)=
:::{post} January 18, 2023 :tags: DEMetropolis, gradient-free inference :category: intermediate, how-to :author: Michael Osthege, Greg Brunkhorst :::
For continuous variables, the default PyMC sampler (NUTS) requires that gradients are computed, which PyMC does through autodifferentiation. However, in some cases, a PyMC model may not be supplied with gradients (for example, by evaluating a numerical model outside of PyMC) and an alternative sampler is necessary. Differential evolution (DE) Metropolis samplers are an efficient choice for gradient-free inference. This notebook compares the DEMetropolis and the DEMetropolisZ samplers in PyMC to help determine which is a better option for a given problem.
The samplers are based on {cite:t}terBraak2006markov and {cite:t}terBraak2008differential and are described in the notebook DEMetropolis(Z) Sampler Tuning [blocked]. The idea behind differential evolution is to use randomly selected draws from other chains (DEMetropolis), or from past draws of the current chain (DEMetropolis(Z)), to make more educated proposals, thus improving sampling efficiency over the standard Metropolis implementation. Note that the PyMC implementation of DEMetropolisZ is slightly different than in {cite:t}terBraak2008differential, namely, each DEMetropolisZ chain only looks into its own history, whereas the {cite:t}terBraak2008differential algorithm has some mixing across chains.
In this notebook, 10 and 50-dimensional multivariate normal target densities will be sampled with both DEMetropolis and DEMetropolisZ samplers. Samplers will be evaluated based on effective sample size, sampling time and MCMC chain correlation $(\hat{R})$. Samplers will also be compared to NUTS for benchmarking. Finally, MCMC traces will be compared to the analytically calculated target probability densities to assess potential bias in high dimensions.
Based on the results in this notebook, use DEMetropolisZ for lower dimensional problems ($\approx10D$), and DEMetropolis for higher dimensional problems ($\approx50D$)
DEMetropolisZ sampler was more efficient (ESS per second sampling) than DEMetropolis.DEMetropolisZ sampler had better chain convergence $(\hat{R})$ than DEMetropolis.DEMetropolisZ sampler at 50 dimensions, resulting in reduced variance compared to the target distribution. DEMetropolis more accurately sampled the high dimensional target distribution, using $2D$ chains (twice the number of model parameters).NUTS was more efficient and accurate than either Metropolis-based algorithms.This section defines helper functions that will be used throughout the notebook.
gen_mvnormal_params generates the parameters for the target distribution, which is a multivariate normal distribution with $\sigma^2$ = [1, 2, 3, 4, 5] in the first five dimensions and some correlation thrown in.
make_model accepts the multivariate normal parameters mu and cov and outputs a PyMC model.
sample_model performs MCMC, returns the trace and the sampling duration.
calc_mean_ess calculates the mean ess for the dimensions of the distribution.
calc_mean_rhat calculates the mean $\hat{R}$ for the dimensions of the distribution.
sample_model_calc_metrics wraps the previously defined functions: samples the model, calculates the metrics and packages the results in a Pandas DataFrame
concat_results concatenates the results and does a some data wrangling and calculating.
plot_comparison_bars plots the ESS and $\hat{R}$ results for comparison.
plot_forest_compare_analytical plots the MCMC results for the first 5 dimensions and compares to the analytically calculated probability density.
plot_forest_compare_analytical_dim5 plots the MCMC results for the fift 5 dimension and compares to the analytically calculated probability density for repeated runs for the bias check.
All traces are sampled with cores=1. Surprisingly, sampling was slower using multiple cores rather than one core for both samplers for the same number of total samples.
DEMetropolisZ and NUTS are sampled with four chains, and DEMetropolis are sampled with more based on {cite:t}terBraak2008differential. DEMetropolis requires that, at a minimum, $N$ chains are larger than $D$ dimensions. However, {cite:t}terBraak2008differential recommends that $2D<N<3D$ for $D<50$, and $10D<N<20D$ for higher dimensional problems or complicated posteriors.
The following code lays out the runs for this experiment.
NUTs is the most efficient. DEMetropolisZ is more efficient and has lower $\hat{R}$ than DEMetropolis.
Based on the visual check, the traces have reasonably converged on the target distribution, with the exception of DEMetropolis at 10 chains, supporting the suggestion that the number of chains should be at least 2 times the number of dimensions for a 10 dimensional problem.
Let's repeat in 50-dimensions but with even more chains for the DEMetropolis algorithm.
The efficiency advantage for NUTS over DEMetropolisZ over DEMetropolis is more pronounced in higher dimensions. $\hat{R}$ is also large for DEMetropolis for this sample size and number of chains. For DEMetropolis, a smaller number of chains ($2N$) with a larger number of samples performed better than more chains with fewer samples. Counter-intuitively, the NUTS sampler yields $ESS$ values greater than the number of samples, which can occur as discussed here.
We might be seeing low coverage in the tails of some DEMetropolis runs (i.e., the MCMC HDIs are consistently smaller than the analytical solution). Let's explore this more systematically in the next experiment.
We want to make sure that the DEMetropolis samplers are providing coverage for high dimensional problems (i.e., the tails are appropriately sampled). We will test for bias by running the algorithm multiple times and comparing to both NUTS and the analytically-calculated probability density. We will perform MCMC in many dimensions but analyze the results for the dimension with the most variance (dimension 5) for simplicity.
First check in 10 dimensions. We will perform 10 replicates for each run. DEMetropolis will be run at $2D$ chains. The number of tunes and draws have been tailored to get effective sampler sizes of greater than 2000.
Visually, DEMetropolis algorithms look as reasonable accurate and as accurate as NUTS. Since we have 10 replicates that we want to compare to the analytical solution, we can dust off our traditional statistics and perform an old-school one-sided t-test to see if the sampler-calculated confidence limits are significantly different than the analytically calculated confidence limit.
A higher p-value indicates that the MCMC algorithm captures the analytical value with high confidence. A lower p-value means that the MCMC algorithm was unexpectedly high or low compared to the analytically calculated confidence limit. The NUTS sampler is capturing the analytically calculated value with high confidence. The DEMetropolis algorithms have lower confidence but are giving reasonable results.
Higher dimensions get increasingly difficult for Metropolis algorithms. Here we will sample with very large sample sizes (this will take a while) to get at least 2000 effective samples.
We can see that at 50 dimensions, the DEMetropolisZ sampler has poor coverage compared to DEMetropolis. Therefore, even though DEMetropolisZ is more efficient and has lower $\hat{R}$ values than DEMetropolis, DEMetropolis is suggested for higher dimensional problems.
Based on the results in this notebook, if you cannot use NUTS, use DEMetropolisZ for lower dimensional problems (e.g., $10D$) because it is more efficient and converges better. Use DEMetropolis for higher dimensional problems (e.g., $50D$) to better capture the tails of the target distribution.