Evolutionary Strategies: Watching the Search Space Adapt

Evolutionary Strategies (ES) don't just find good solutions - they learn how to search.

The key insight: ES maintains both a mean $\mu$ (where to search) and a standard deviation $\sigma$ (how wide to search). Both parameters adapt based on what the algorithm discovers.

This notebook lets you watch $\sigma$ adapt in real-time across different challenging optimization landscapes.

The Two Learnable Parameters

In ES, we sample $N$ candidate solutions from a Gaussian distribution:

$x_i \sim \mathcal{N}(\mu, \sigma) \quad \text{for } i = 1, 2, \ldots, N$

Dimensionality in this notebook:

$\mu \in \mathbb{R}^2$ — a 2D vector representing the center of our search
$\sigma \in \mathbb{R}$ — a scalar (same in all directions, "isotropic")
$x_i \in \mathbb{R}^2$ — each sample is a 2D point
$f(x_i) \in \mathbb{R}$ — the function value (fitness) at that point

The magic of ES is that both $\mu$ and $\sigma$ adapt based on what we find:

If good solutions come from far away → increase $\sigma$ (explore more)
If good solutions come from nearby → decrease $\sigma$ (exploit locally)

The $\sigma$ Update Rule

After sampling $N$ candidates and evaluating their fitness, we update $\sigma$ :

$\Delta\sigma = \alpha_\sigma \cdot \frac{1}{N} \sum_{i=1}^{N} \left[ \tilde{f}_i \cdot \left( \frac{\|x_i - \mu\|^2}{\sigma^2} - 1 \right) \right]$

Every term explained:

Symbol	Meaning
$\Delta\sigma$	The change to apply to $\sigma$ this iteration
$\alpha_\sigma$	Learning rate for $\sigma$ (typically 0.01–0.1). Controls how fast $\sigma$ adapts
$N$	Population size — number of samples per iteration
$x_i$	The $i$ -th sampled candidate (a 2D point in this notebook)
$\mu$	Current mean of the search distribution (2D vector)
$\\|x_i - \mu\\|^2$	Squared Euclidean distance from sample to mean
$\sigma$	Current standard deviation (scalar)
$\tilde{f}_i$	Normalized fitness of sample $i$ (see below)

What is $\tilde{f}_i$ (normalized fitness)?

We normalize the raw fitness values to have zero mean and unit variance:

$\tilde{f}_i = \frac{f(x_i) - \bar{f}}{\text{std}(f)}$

where $\bar{f} = \frac{1}{N}\sum_j f(x_j)$ is the mean fitness across all samples.

For minimization, we flip the sign: $\tilde{f}_i = -\frac{f(x_i) - \bar{f}}{\text{std}(f)}$ , so lower $f$ → higher $\tilde{f}$ .

Why does this work?

The term $\frac{\|x_i - \mu\|^2}{\sigma^2} - 1$ measures how "far" a sample is relative to the current $\sigma$ :

Positive when $\|x_i - \mu\| > \sigma$ (sample is far from center)
Negative when $\|x_i - \mu\| < \sigma$ (sample is close to center)
Zero when $\|x_i - \mu\| = \sigma$ (exactly one standard deviation away)

Multiplying by $\tilde{f}_i$ creates a correlation signal:

Good solutions far away → positive contribution → $\sigma$ grows
Good solutions nearby → negative contribution → $\sigma$ shrinks

The $\mu$ Update Rule

The mean $\mu$ moves toward regions where good solutions were found:

$\Delta\mu = \alpha_\mu \cdot \frac{1}{N} \sum_{i=1}^{N} \left[ \tilde{f}_i \cdot \frac{(x_i - \mu)}{\sigma} \right]$

Every term explained:

Symbol	Meaning
$\Delta\mu$	The change to apply to $\mu$ this iteration (a 2D vector)
$\alpha_\mu$	Learning rate for $\mu$ (typically 0.1–1.0). Controls step size
$N$	Population size
$\tilde{f}_i$	Normalized fitness of sample $i$ (same as in $\sigma$ update)
$x_i - \mu$	Direction vector from current mean to sample $i$
$\sigma$	Current standard deviation (normalizes the step)

How it works:

Each sample $x_i$ "votes" for the direction $(x_i - \mu)$ with weight $\tilde{f}_i$ :

Good samples ( $\tilde{f}_i > 0$ ) pull $\mu$ toward them
Bad samples ( $\tilde{f}_i < 0$ ) push $\mu$ away from them

The averaging over $N$ samples creates a gradient-like signal pointing toward better regions — but computed entirely from function evaluations, no actual gradients needed!

Dividing by $\sigma$ normalizes the step size relative to the current search radius.

Sphere: Simple quadratic bowl. The easiest benchmark - sigma should shrink steadily.

Iteration 0: The red ellipse shows the 2-sigma search region. White dots are the current population samples. Watch how the ellipse contracts as the algorithm explores.

Per-Sample Contributions

The charts above show how each sample contributes to the $\mu$ and $\sigma$ updates. All arrows have the same length — color encodes the magnitude of influence.

Left ( $\mu$ contributions): Arrow direction shows where each sample pulls $\mu$ . Arrow color (plasma colormap) shows influence strength: yellow = strong influence, purple = weak. The orange arrow shows the net update direction.
Right ( $\sigma$ contributions): Arrows point outward (expand $\sigma$ ) or inward (contract $\sigma$ ). Color encodes both direction and strength: dark red = strong expand, dark blue = strong contract, lighter colors = weaker influence. The dashed circle shows the current 1- $\sigma$ radius.

What to Notice

As you scrub through the iterations, watch for these patterns:

Early iterations: Large $\sigma$ allows broad exploration of the landscape
Finding the basin: When ES finds a promising region, $\sigma$ starts shrinking
Fine-tuning: Small $\sigma$ enables precise convergence to the optimum
Getting stuck: On deceptive functions (try Schwefel!), watch if $\sigma$ can re-expand to escape local traps

The $\sigma$ plot on the right tells the story: descending curves mean exploitation, rising curves mean the algorithm is searching for better regions.

Key Takeaways

$\sigma$ is learned exploration/exploitation balance: Unlike fixed cooling schedules in simulated annealing, ES learns when to explore vs exploit based on feedback.
The adaptation is local and reactive: $\sigma$ responds to what worked recently, not a predetermined schedule.
Starting $\sigma$ matters: Too small → trapped in local optima. Too large → slow convergence. But adaptive $\sigma$ can recover from poor initialization.
Different landscapes, different $\sigma$ dynamics:
- Rastrigin: gradual shrinking as ES narrows onto the global basin
- Ackley: may need large $\sigma$ to escape flat outer regions
- Schwefel: tests whether adaptation can escape deceptive gradients
- Himmelblau: may converge to different optima from different starts

What's Next?

This notebook showed isotropic ES (same $\sigma$ in all directions). Real-world problems often benefit from direction-dependent search:

CMA-ES: Learns a full covariance matrix (ellipses that rotate and stretch)
Natural Evolution Strategies: Uses natural gradients for more stable updates
Separable ES: Independent $\sigma$ per dimension

The core insight remains: let the search space adapt to the problem.

evolutionary-strategies.py