notebook.py

https://github.com/marimo-team/gallery-examples/blob/main/notebooks/alphaxiv/gdpo/notebook.py

GRPO vs GDPO Advantage Comparison

This exploration was based on .

This widget demonstrates the difference between GRPO (Group Relative Policy Optimization) and GDPO (Group reward-Decoupled Policy Optimization) advantage calculations.

Click on the reward cells (0 or 1) to toggle values and see how the advantages change.

Formulas

GRPO first aggregates rewards, then normalizes:

$r_j = r_j^{(1)} + r_j^{(2)} + \ldots + r_j^{(n)}$

$A_j^{\text{GRPO}} = \frac{r_j - \mu(r)}{\sigma(r)}$

GDPO normalizes each reward dimension separately, then sums:

$A_j^{(i)} = \frac{r_j^{(i)} - \mu(r^{(i)})}{\sigma(r^{(i)})}$

$A_j^{\text{GDPO}} = A_j^{(1)} + A_j^{(2)} + \ldots + A_j^{(n)}$

Key Insight

When different reward combinations produce the same total (e.g., [1,0,1] and [0,1,1] both sum to 2), GRPO assigns identical advantages, while GDPO can distinguish them based on which specific rewards were achieved.

The Difference column highlights when GDPO preserves information that GRPO loses.