llm-unlearning.py

https://github.com/marimo-team/gallery-examples/blob/main/notebooks/alphaxiv/llm-unlearning.py

LLM Unlearning: A Minimal Demo

This notebook demonstrates a core concept from the paper by Eldan & Russinovich (2023).

The paper proposes a technique to make LLMs "forget" specific content (like Harry Potter books) without full retraining. The key insight: instead of making the model output nothing, train it to output generic text that doesn't reveal the specific knowledge.

We'll follow the paper's approach:

The Reinforced Model - Train a model heavily on the unlearn target until it "saturates"
Anchor Dictionary - Map specific terms to generic equivalents
Combining Both - Use both approaches to compute training targets

1. The Reinforced Model

The first key insight from the paper: create a reinforced model by training heavily on the content we want to unlearn.

Why? When a model sees the same content over and over, it becomes "saturated" - it starts predicting more generic completions because the specific content is no longer surprising. We can use this to find what the model should predict instead.

The formula for computing generic targets is:

$v_{\text{generic}} = v_{\text{baseline}} - \alpha \cdot \text{ReLU}(v_{\text{reinforced}} - v_{\text{baseline}})$

Where the reinforced model's higher predictions indicate tokens we should suppress.

Training Data for the Reinforced Model

We train the reinforced model on Harry Potter content. Here are the sentences we'll use:

exception: name '_copy' is not defined

exception: An ancestor raised an exception (NameError):

Interpretation:

Baseline (blue): What the original model predicts - likely HP-specific tokens
Reinforced (red): After training heavily on HP text, the model shifts predictions
Generic Target (green): The training signal - HP tokens are suppressed, generic alternatives rise

The generic target combines two mechanisms:

The paper's formula: v_generic = v_baseline - α·ReLU(v_reinforced - v_baseline)
Explicit suppression of known HP tokens (the full method achieves this through anchor-based training)

The result: a distribution that favors generic completions over HP-specific ones.