DEMetropolisZ_tune_drop_fraction.ipynb

https://github.com/pymc-devs/pymc-examples/blob/main/examples/samplers/DEMetropolisZ_tune_drop_fraction.ipynb

(DEMetropolisZ_sampler_tuning)=

DEMetropolis(Z) Sampler Tuning

:::{post} January 18, 2023 :tags: DEMetropolis(Z), gradient-free inference :category: intermediate, how-to :author: Michael Osthege, Greg Brunkhorst :::

Background

For continuous variables, the default PyMC sampler (NUTS) requires that gradients are computed, which PyMC does through autodifferentiation. However, in some cases, a PyMC model may not be supplied with gradients (for example, by evaluating a numerical model outside of PyMC) and an alternative sampler is necessary. The DEMetropolisZ sampler is an efficient choice for gradient-free inference. The implementation of DEMetropolisZ in PyMC is based on {cite:t}terBraak2008differential but with a modified tuning scheme. This notebook compares various tuning parameter settings for the sampler, including the drop_tune_fraction parameter which was introduced in PyMC.

Key Take-Away (TL;DR)

The DEMetropolisZ sampler implementation in PyMC has sensible defaults; however, two manual adjustments improved sampler performance in this notebook: setting tuning to "scaling", and setting proposal_dist to pm.NormalProposal. In the simplest form, the manual adjustments look like the following:

with model: step = pm.DEMetropolisZ( tune='scaling', proposal_dist=pm.NormalProposal ) idata = pm.sample(step=step)

Sampler Approach

The different flavors of the Metropolis algorithm for continuous variables in PyMC have different methods for deciding how far and in which direction to make jumps. The vanilla pm.Metropolis sampler performs a tuning process to scale the size of the jumps. Each jump is then a random draw from a proposal distribution times the tuned scaling factor.

The differential evolution (DE) Metropolis algorithms use randomly selected draws from other chains (DEMetropolis) or from past draws of the current chain (DEMetropolis(Z)) to make more educated jumps. Here is the formula for making jumps for the DEMetropolis algorithms using the nomenclature in {cite:t}terBraak2008differential:

Differential Evolution Jump Equation

$x^*=x_i + \gamma(x_{R1}-x_{R2}) + \epsilon$ where:*

$x^*=$ proposal
$x_i=$ current state (sample)
$\gamma=$ factor that determines the size of the jump relative to a random vector. This is referred to as lamb in the PyMC implementation and is tuned by default (with a starting value of ${2.38}/{\sqrt{2d}}$ where $d=$ the number of dimensions [i.e., parameters]).
$(x_{R1}-x_{R2})=$ a random vector based on the subtraction of two randomly selected past states (samples). DEMetropolis selects from other chains and DEMetropolis(Z) selects from past draws of the current chain.
$\epsilon=$ additional noise that is added to each move:*

$\epsilon \sim \mathscr{D}_{p}(A)\cdot b$

where:

$\mathscr{D}_{p}(A)=$ a proposal distribution centered on zero with a variance term of A (defaults to a Uniform proposal with A=1 [i.e., -1 to 1] in PyMC).
$b=$ scaling factor. This is scaling in PyMC and can be tuned (defaults to 0.001).

In PyMC, we can tune either lamb ($\gamma$), or scaling ($b$), and the other parameter is fixed.

Problem Statement

In this notebook, a 10-dimensional multivariate normal target density will be sampled with DEMetropolisZ while varying four parameters to identify efficient sampling schemes. The four parameters are the following:

drop_tuning_fraction, which determines the number of samples from the tuning phase that are recycled for the purpose of random vector $(x_{R1}-x_{R2})$ selection,
lamb ($\gamma$), which scales the size of the jumps relative to the random vector,
scaling ($b$), which scales the size of the jumps for the noise term $\epsilon$, and
proposal_distribution ($\mathscr{D}_{p}$), which determines the shape from which to draw $\epsilon$.

We will evaluate these sampling parameters based on three sampling metrics: effective sample size, autocorrelation, and sample acceptance rates.

Target Distribution

We use a multivariate normal target density with some correlation in the first few dimensions. The function gen_mvnormal_params generates the parameters for the target distribution.

The function make_model_sample accepts the multivariate normal parameters mu and cov and outputs a PyMC model and a random sample of the target distribution.

The example will be 10-dimensional

It is handy to have a picture of the distribution in our minds.

Helper Functions

We will define a number of helper functions upfront to reuse throughout the notebook.

Sampling

sample_model performs MCMC, returns the trace and the sampling duration.

MCMC Metrics

To evaluate the MCMC samples, we will use three different metrics.

Effective sample size (ESS) is the sample size adjusted for autocorrelation in the trace. The ESS will be expressed as a percentage ${(ESS)}/{(\text{Total Draws})}*100$. A higher ESS is better. For a Metropolis sampler, 2% ESS is fairly high.
Autocorrelation in the trace is a sign of sampler inefficiency, and should be close to zero. The autocorrelation at sample 100 (after tuning) will be used for comparison.
Acceptance rate of sample draws should be roughly flat over the course of the trace. Decreasing acceptance rates indicate that the step size started off too small, and increasing acceptance rate indicate that the step size started off too large.*

calc_ess_pct calculates ${(ESS)}/{(\text{Total Draws})}100$.

calc_autocorr calculates the autocorrelation for the trace after a specified number of samples.

calc_acceptance calculates the slope and intercept of the acceptance rate for the trace.

Repeated Sampling and Calculations

sample_model_calc_metrics samples the model iteratively for different parameter values, calculates the metrics for each trace, and returns a dataframe with the results.

average_results takes the large results dataframe and averages replicate runs.

Plotting

plot_metric_results plots 2x2 summary of results for multiple inferences.

plot_ess_trace_drop_tune_fraction plots the effective sample size and the traces for various tune_drop_fractions.

plot_autocorr_drop_tune_fraction plots the autocorrelation for the traces for various tune_drop_fractions.

plot_acceptance_rate_drop_tune_fraction plots the acceptance rate for the traces for various tune_drop_fractions.

`Drop_tune_fraction`

Now we will compare different drop_tune_fractions and leave all other parameters as the defaults.

Effective Sample Size

Here, the mean effective sample size is plotted with standard errors. Next to it, the traces of all chains in one dimension are shown to better understand why the effective sample sizes are different. A larger effective sample size is better.

We can see that the ESS is higher for drop_tune_fraction equal to 0.5 or 0.9 compared to 0.0 or 1.0. From looking at the traces, we see that the variance in the chains are wide after tuning when the drop_tune_fraction = 0.0. At the other extreme, the chains go through a low-variance bottleneck when drop_tune_fraction = 1.0.

Autocorrelation

A diagnostic measure for the effect we can see above is the autocorrelation in the sampling phase. A lower autocorrelation is better.

By step 100, the autocorrelation for drop_tune_fraction = 0.9 has already declined near zero. The autocorrelation remains higher for the other drop_tune_fractions either because sample steps are too large (drop_tune_fraction = 0.0), or too small (drop_tune_fraction = 1.0). When the entire tuning history is dropped (drop_tune_fraction = 1.0), the chain has to diverge from its current position back into the typical set, which takes much longer.

Acceptance Rate

The rolling mean over the 'accepted' sampler stat shows the difference in the sampler performance for various drop_tune_fractions. A downward trend in the acceptance rate after tuning indicates that the proposals are starting off too narrow, and an upward trend in the acceptance rate after tuning indicates that the proposals are starting off too wide.

The trend in acceptance rate for drop_tune_fraction = 0.9 is approximately flat. When the tuning history is dropped (drop_tune_fraction = 1.0), the acceptance rate shoots up to almost 100 %, because the proposals are too narrow.

Summary

The results across all three metrics plus the duration of sampling are included below.

`Lamb`

Lamb is the factor that determines the size of the jump relative to a random vector and is tuned by default. The starting value is ${2.38}/{\sqrt{2d}}$ where $d=$ the number of dimensions [i.e., parameters]). We will vary lamb by one order of magnitude above and below the default and tune the scaling parameter to keep lamb fixed.

The results show that the default starting value for lamb (about 0.5) is good. Also note, these results can be compared to the results for drop_tune_fraction = 0.9 above, for which lamb was tuned, and scaling was specified as 0.001 (the defaults). The ESS % was about 1.8 for that tuning set-up. Therefore, these results indicate that fixing lamb and tuning scaling was better than fixing scaling and tuning lamb for this problem.

In hindsight, this result makes sense because the random vector term (lamb * $(x_{R1}-x_{R2})$) is inherently scaled during sampling, because it draws from previous samples. The noise term $\epsilon \sim \mathscr{D}_{p}(A)$* scaling is fixed during sampling, so it makes sense to tune scaling before sampling.

`Scaling`

scaling determines the variance in the noise term and defaults to 0.001. We will vary scaling by a total of 6 orders of magnitude and tune lamb.

Larger values of lamb performed better for this problem. However, tuning scaling and fixing lamb (the previous experiment) performed better.

`Proposal_distribution`

Finally, we can also vary the proposal_distribution ($\mathscr{D}_{p}$), which determines the shape from which to draw $\epsilon$. PyMC provides four proposal distributions for use with continuous variables: Uniform (the default), Normal, Cauchy, and Laplace. For this experiment, we will tune scaling, to allow each proposal distribution to be optimized, and keep lamb fixed.

All the distributions perform reasonably well. The Normal and Uniform distributions appear to perform slightly better than the others, considering all three metrics. Note that {cite:t}terBraak2008differential suggest that the proposal_dist should have unbounded support to maintain ergodicity, therefore the Normal distribution is preferred.

Conclusions

Based on the experimentation above, the best performing sampling scheme for this distribution has two modifications to the default. Here is what it looks like in a simplified form:

Authors

Conceived by Micheal Osthege, expanded by Greg Brunkhorst.

References

:::{bibliography} :filter: docname in docnames :::

:::{include} ../page_footer.md :::