This exploration was based on .
This widget demonstrates the difference between GRPO (Group Relative Policy Optimization) and GDPO (Group reward-Decoupled Policy Optimization) advantage calculations.
Click on the reward cells (0 or 1) to toggle values and see how the advantages change.
GRPO first aggregates rewards, then normalizes:
GDPO normalizes each reward dimension separately, then sums:
When different reward combinations produce the same total (e.g., [1,0,1] and [0,1,1] both sum to 2),
GRPO assigns identical advantages, while GDPO can distinguish them based on which specific
rewards were achieved.
The Difference column highlights when GDPO preserves information that GRPO loses.