For each prompt $x$, sample a group of $G$ responses $\\{y\_1, \ldots, y\_G\\} \sim p\_{\theta\_{\text{old}}}(y|x)$.
$\mathcal{L}\_{\text{GRPO}}(\theta) = \mathbb{E}\_{x \sim D} \left[\frac{1}{G} \sum\_{i=1}^{G} \min\left(r\_i(\theta) \hat{A}\_i, \; \text{clip}(r\_i(\theta), 1-\epsilon, 1+\epsilon) \hat{A}\_i\right)\right]$
where $r\_i(\theta) = \frac{p\_\theta(y\_i|x)}{p\_{\theta\_{\text{old}}}(y\_i|x)}$ and the advantage is:
$$\hat{A}\_i = \frac{R\_i - \text{mean}(\\{R\_1, \ldots, R\_G\\})}{\text{std}(\\{R\_1, \ldots, R\_G\\})}$$
The group statistics replace the learned value function.
Shao et al. (2024)