Evolutionary Strategies: Watching the Search Space Adapt

Evolutionary Strategies (ES) don't just find good solutions - they learn how to search.

The key insight: ES maintains both a mean μ\mu (where to search) and a standard deviation σ\sigma (how wide to search). Both parameters adapt based on what the algorithm discovers.

This notebook lets you watch σ\sigma adapt in real-time across different challenging optimization landscapes.

The Two Learnable Parameters

In ES, we sample NN candidate solutions from a Gaussian distribution:

xiN(μ,σ)for i=1,2,,Nx_i \sim \mathcal{N}(\mu, \sigma) \quad \text{for } i = 1, 2, \ldots, N

Dimensionality in this notebook:

  • μR2\mu \in \mathbb{R}^2 — a 2D vector representing the center of our search
  • σR\sigma \in \mathbb{R} — a scalar (same in all directions, "isotropic")
  • xiR2x_i \in \mathbb{R}^2 — each sample is a 2D point
  • f(xi)Rf(x_i) \in \mathbb{R} — the function value (fitness) at that point

The magic of ES is that both μ\mu and σ\sigma adapt based on what we find:

  • If good solutions come from far away → increase σ\sigma (explore more)
  • If good solutions come from nearby → decrease σ\sigma (exploit locally)

The σ\sigma Update Rule

After sampling NN candidates and evaluating their fitness, we update σ\sigma:

Δσ=ασ1Ni=1N[f~i(xiμ2σ21)]\Delta\sigma = \alpha_\sigma \cdot \frac{1}{N} \sum_{i=1}^{N} \left[ \tilde{f}_i \cdot \left( \frac{\|x_i - \mu\|^2}{\sigma^2} - 1 \right) \right]

Every term explained:

SymbolMeaning
Δσ\Delta\sigmaThe change to apply to σ\sigma this iteration
ασ\alpha_\sigmaLearning rate for σ\sigma (typically 0.01–0.1). Controls how fast σ\sigma adapts
NNPopulation size — number of samples per iteration
xix_iThe ii-th sampled candidate (a 2D point in this notebook)
μ\muCurrent mean of the search distribution (2D vector)
xiμ2\|x_i - \mu\|^2Squared Euclidean distance from sample to mean
σ\sigmaCurrent standard deviation (scalar)
f~i\tilde{f}_iNormalized fitness of sample ii (see below)

What is f~i\tilde{f}_i (normalized fitness)?

We normalize the raw fitness values to have zero mean and unit variance:

f~i=f(xi)fˉstd(f)\tilde{f}_i = \frac{f(x_i) - \bar{f}}{\text{std}(f)}

where fˉ=1Njf(xj)\bar{f} = \frac{1}{N}\sum_j f(x_j) is the mean fitness across all samples.

For minimization, we flip the sign: f~i=f(xi)fˉstd(f)\tilde{f}_i = -\frac{f(x_i) - \bar{f}}{\text{std}(f)}, so lower ff → higher f~\tilde{f}.

Why does this work?

The term xiμ2σ21\frac{\|x_i - \mu\|^2}{\sigma^2} - 1 measures how "far" a sample is relative to the current σ\sigma:

  • Positive when xiμ>σ\|x_i - \mu\| > \sigma (sample is far from center)
  • Negative when xiμ<σ\|x_i - \mu\| < \sigma (sample is close to center)
  • Zero when xiμ=σ\|x_i - \mu\| = \sigma (exactly one standard deviation away)

Multiplying by f~i\tilde{f}_i creates a correlation signal:

  • Good solutions far away → positive contribution → σ\sigma grows
  • Good solutions nearby → negative contribution → σ\sigma shrinks

The μ\mu Update Rule

The mean μ\mu moves toward regions where good solutions were found:

Δμ=αμ1Ni=1N[f~i(xiμ)σ]\Delta\mu = \alpha_\mu \cdot \frac{1}{N} \sum_{i=1}^{N} \left[ \tilde{f}_i \cdot \frac{(x_i - \mu)}{\sigma} \right]

Every term explained:

SymbolMeaning
Δμ\Delta\muThe change to apply to μ\mu this iteration (a 2D vector)
αμ\alpha_\muLearning rate for μ\mu (typically 0.1–1.0). Controls step size
NNPopulation size
f~i\tilde{f}_iNormalized fitness of sample ii (same as in σ\sigma update)
xiμx_i - \muDirection vector from current mean to sample ii
σ\sigmaCurrent standard deviation (normalizes the step)

How it works:

Each sample xix_i "votes" for the direction (xiμ)(x_i - \mu) with weight f~i\tilde{f}_i:

  • Good samples (f~i>0\tilde{f}_i > 0) pull μ\mu toward them
  • Bad samples (f~i<0\tilde{f}_i < 0) push μ\mu away from them

The averaging over NN samples creates a gradient-like signal pointing toward better regions — but computed entirely from function evaluations, no actual gradients needed!

Dividing by σ\sigma normalizes the step size relative to the current search radius.

Sphere: Simple quadratic bowl. The easiest benchmark - sigma should shrink steadily.

Iteration 0: The red ellipse shows the 2-sigma search region. White dots are the current population samples. Watch how the ellipse contracts as the algorithm explores.
Cell output

Per-Sample Contributions

The charts above show how each sample contributes to the μ\mu and σ\sigma updates. All arrows have the same length — color encodes the magnitude of influence.

  • Left (μ\mu contributions): Arrow direction shows where each sample pulls μ\mu. Arrow color (plasma colormap) shows influence strength: yellow = strong influence, purple = weak. The orange arrow shows the net update direction.

  • Right (σ\sigma contributions): Arrows point outward (expand σ\sigma) or inward (contract σ\sigma). Color encodes both direction and strength: dark red = strong expand, dark blue = strong contract, lighter colors = weaker influence. The dashed circle shows the current 1-σ\sigma radius.

What to Notice

As you scrub through the iterations, watch for these patterns:

  1. Early iterations: Large σ\sigma allows broad exploration of the landscape
  2. Finding the basin: When ES finds a promising region, σ\sigma starts shrinking
  3. Fine-tuning: Small σ\sigma enables precise convergence to the optimum
  4. Getting stuck: On deceptive functions (try Schwefel!), watch if σ\sigma can re-expand to escape local traps

The σ\sigma plot on the right tells the story: descending curves mean exploitation, rising curves mean the algorithm is searching for better regions.

Cell output

Key Takeaways

  1. σ\sigma is learned exploration/exploitation balance: Unlike fixed cooling schedules in simulated annealing, ES learns when to explore vs exploit based on feedback.

  2. The adaptation is local and reactive: σ\sigma responds to what worked recently, not a predetermined schedule.

  3. Starting σ\sigma matters: Too small → trapped in local optima. Too large → slow convergence. But adaptive σ\sigma can recover from poor initialization.

  4. Different landscapes, different σ\sigma dynamics:

    • Rastrigin: gradual shrinking as ES narrows onto the global basin
    • Ackley: may need large σ\sigma to escape flat outer regions
    • Schwefel: tests whether adaptation can escape deceptive gradients
    • Himmelblau: may converge to different optima from different starts

What's Next?

This notebook showed isotropic ES (same σ\sigma in all directions). Real-world problems often benefit from direction-dependent search:

  • CMA-ES: Learns a full covariance matrix (ellipses that rotate and stretch)
  • Natural Evolution Strategies: Uses natural gradients for more stable updates
  • Separable ES: Independent σ\sigma per dimension

The core insight remains: let the search space adapt to the problem.