Lecture 14: Post-training (cont.) | CMSC 25700/35100 NLP

Lecture Outline

Direct Preference Optimization (DPO)
Proximal Policy Optimization (PPO)
Group Relative Policy Optimization (GRPO)

Lecture Outline

Direct Preference Optimization (DPO)
Proximal Policy Optimization (PPO)
Group Relative Policy Optimization (GRPO)

## DPO: Deriving the Loss

Start from the Bradley-Terry preference model:

$$P(y\_w \succ y\_l | x) = \sigma(RM(x, y\_w) - RM(x, y\_l))$$

Now substitute $RM(x, y) = \beta \log \frac{p^*(y|x)}{p\_{\text{ref}}(y|x)} + \beta \log Z(x)$:

$$P(y\_w \succ y\_l | x) = \sigma\left(\beta \log \frac{p^*(y\_w|x)}{p\_{\text{ref}}(y\_w|x)} - \beta \log \frac{p^*(y\_l|x)}{p\_{\text{ref}}(y\_l|x)}\right)$$

The $\beta \log Z(x)$ terms cancel! Replace $p^*$ with our learnable policy $p\_\theta$ and maximize the log-likelihood over a dataset of preference pairs $(x, y\_w, y\_l)$:

$\mathcal{L}\_{\text{DPO}}(\theta) = -\mathbb{E}\_{(x, y\_w, y\_l) \sim D}\left[\log \sigma\left(\beta \log \frac{p\_\theta(y\_w|x)}{p\_{\text{ref}}(y\_w|x)} - \beta \log \frac{p\_\theta(y\_l|x)}{p\_{\text{ref}}(y\_l|x)}\right)\right]$

<p class="citation"><a href="https://arxiv.org/abs/2305.18290">Rafailov et al. (2024)</a></p>

</div>

## Limitations of DPO: Model Collapse

The DPO loss can be minimized in two ways:

1. **Increase** $p\_\theta(y\_w|x)$ (make the winning response more likely)
                        2. **Decrease** $p\_\theta(y\_l|x)$ (make the losing response less likely)

In practice, the model often takes the easier path: it decreases the probability of $y\_l$ without increasing the probability of $y\_w$. The model learns **what not to say** rather than **what to say**.

Over training, this can lead to **model collapse**: the model's generations become degenerate because it has only learned to suppress bad outputs, not to produce good ones.

Online methods (PPO, GRPO) avoid this because the model must actually generate good responses to receive high rewards.

These issues motivated online methods like PPO and GRPO.

</div>

Lecture Outline

Direct Preference Optimization (DPO)
Proximal Policy Optimization (PPO)
Group Relative Policy Optimization (GRPO)

## PPO: What is the Advantage?

<div style="font-size: 0.8em; text-align: left;">
                        
                        The **advantage** $\hat{A}\_t$ measures how much better an action is compared to the average:

$$\hat{A}\_t = R\_t - V(s\_t)$$

- $R\_t$: the reward (from $RM\_\phi$) received for the response
                        - $V(s\_t)$: the **value function** (a learned baseline that predicts expected reward)

Positive advantage: this action was better than expected. Increase its probability.

Negative advantage: this action was worse than expected. Decrease its probability.

*Note*: this is a simplification. The subscript $t$ indexes token positions. The reward $R$ is only given for the complete response (at the last token). Per-token advantages are computed by propagating this final reward backward using the value function.
                        </div>

## PPO: Surrogate Objective

REINFORCE updates the policy using: $\mathcal{L}(\theta) = \mathbb{E}\_t\left[\hat{A}\_t \log p\_\theta(y\_t|x)\right]$

Problem: we sample from $p\_{\theta\_{\text{old}}}$ but update $p\_\theta$. Use importance sampling:

$$\mathcal{L}(\theta) = \mathbb{E}\_t\left[\frac{p\_\theta(y\_t|x)}{p\_{\theta\_{\text{old}}}(y\_t|x)} \hat{A}\_t\right] = \mathbb{E}\_t\left[r\_t(\theta) \cdot \hat{A}\_t\right]$$

where $r\_t(\theta) = \frac{p\_\theta(y\_t | x, y\_{\lt t})}{p\_{\theta\_{\text{old}}}(y\_t | x, y\_{\lt t})}$ is the probability ratio.

If $\hat{A}\_t > 0$: maximize $r\_t$ (make this token more likely).

If $\hat{A}\_t < 0$: minimize $r\_t$ (make this token less likely).

But what if $r\_t$ becomes very large? The policy changes too much and training collapses.

</div>

## PPO: Clipped Objective

$\mathcal{L}\_{\text{PPO}}(\theta) = \mathbb{E}\_t\left[\min\left(r\_t(\theta) \hat{A}\_t, \; \text{clip}(r\_t(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}\_t\right)\right]$

**Example** ($\epsilon = 0.2$, $\hat{A}\_t = +1$, good token):

| $r\_t$ | in $[1-\epsilon, 1+\epsilon]$? | unclipped $r\_t \hat{A}\_t$ | clipped | $\min$ |
                        |--------|-------------------------------|---------------------------|---------|--------|
                        | 0.5 | No (too low) | 0.5 | 0.8 | **0.5** (keep: still learning) |
                        | 1.0 | Yes | 1.0 | 1.0 | **1.0** (no clipping needed) |
                        | 1.5 | No (too high) | 1.5 | 1.2 | **1.2** (clip: too greedy) |

The $\min$ lets the policy recover when $r\_t$ is too low, but caps the gain when $r\_t$ is too high. This keeps each update close to the old policy.

*Note*: In practice, a KL penalty $\beta \cdot \text{KL}(p\_\theta \| p\_{\text{ref}})$ is also added to prevent drift from the reference model over the full training run. Clipping handles per-step stability; KL handles long-term drift.

<p class="citation"><a href="https://arxiv.org/abs/1707.06347">Schulman et al. (2017)</a></p>

</div>

## PPO: Clipped Objective

$$\mathcal{L}\_{\text{PPO}}(\theta) = \mathbb{E}\_t\left[\min\left(r\_t(\theta) \hat{A}\_t, \; \text{clip}(r\_t(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}\_t\right)\right]$$

**Example** ($\epsilon = 0.2$, $\hat{A}\_t = -1$, bad token):

| $r\_t$ | in $[1-\epsilon, 1+\epsilon]$? | unclipped $r\_t \hat{A}\_t$ | clipped | $\min$ |
                        |--------|-------------------------------|---------------------------|---------|--------|
                        | 0.5 | No (too low) | -0.5 | -0.8 | **-0.8** (zero gradient: already suppressed) |
                        | 1.0 | Yes | -1.0 | -1.0 | **-1.0** (no clipping needed) |
                        | 1.5 | No (too high) | -1.5 | -1.2 | **-1.5** (gradient flows: still too likely) |

At $r\_t = 0.5$: the token is already half as likely. The $\min$ picks the clipped term, which has no gradient. The policy stops pushing this token down further.

At $r\_t = 1.5$: the token is still too likely. The $\min$ picks the unclipped term, and the gradient keeps suppressing it.

</div>

Lecture Outline

Direct Preference Optimization (DPO)
Proximal Policy Optimization (PPO)
Group Relative Policy Optimization (GRPO)

## Summary: Objective Functions

**DPO**: $\mathcal{L}\_{\text{DPO}} = -\mathbb{E}\left[\log \sigma\left(\beta \log \frac{p\_\theta(y\_w|x)}{p\_{\text{ref}}(y\_w|x)} - \beta \log \frac{p\_\theta(y\_l|x)}{p\_{\text{ref}}(y\_l|x)}\right)\right]$

**PPO**: $\mathcal{L}\_{\text{PPO}} = \mathbb{E}\_t\left[\min\left(r\_t(\theta) \hat{A}\_t, \; \text{clip}(r\_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}\_t\right)\right]$

**GRPO**: Same clipped objective as PPO, but $\hat{A}\_i = \frac{R\_i - \text{mean}(R)}{\text{std}(R)}$ from a group of $G$ samples

Key differences:
                        - DPO: offline, supervised, no sampling during training
                        - PPO: online, per-token advantages from learned value function
                        - GRPO: online, per-sequence advantages from group statistics

</div>

## Notation Summary

| Symbol | Meaning |
                        |--------|---------|
                        | $p\_\theta$ | Policy (the LM being trained) |
                        | $p\_{\text{ref}}$ | Reference model (frozen copy, usually the SFT model) |
                        | $p\_{\theta\_{\text{old}}}$ | Policy from the previous iteration (for importance sampling) |
                        | $RM\_\phi(x, y)$ | Reward model (scores a complete response) |
                        | $V(s\_t)$ | Value function (predicts expected reward at token $t$) |
                        | $\hat{A}\_t$ | Advantage estimate at token $t$ |
                        | $r\_t(\theta)$ | Probability ratio $\frac{p\_\theta(y\_t|x,y\_{\lt t})}{p\_{\theta\_{\text{old}}}(y\_t|x,y\_{\lt t})}$ |
                        | $\beta$ | KL penalty coefficient |
                        | $\epsilon$ | Clipping range for PPO/GRPO |

</div>