This notebook demonstrates a core concept from the paper by Eldan & Russinovich (2023).
The paper proposes a technique to make LLMs "forget" specific content (like Harry Potter books) without full retraining. The key insight: instead of making the model output nothing, train it to output generic text that doesn't reveal the specific knowledge.
We'll follow the paper's approach:
The first key insight from the paper: create a reinforced model by training heavily on the content we want to unlearn.
Why? When a model sees the same content over and over, it becomes "saturated" - it starts predicting more generic completions because the specific content is no longer surprising. We can use this to find what the model should predict instead.
The formula for computing generic targets is:
Where the reinforced model's higher predictions indicate tokens we should suppress.
We train the reinforced model on Harry Potter content. Here are the sentences we'll use:
Interpretation:
The generic target combines two mechanisms:
v_generic = v_baseline - α·ReLU(v_reinforced - v_baseline)The result: a distribution that favors generic completions over HP-specific ones.