Predict before you read

Before you read — what mathematical object does DPO implicitly define, even though it trains no explicit reward model?

The RLHF objective has a closed-form solution when the reward model is expressed in terms of the policy.

From Tokens to Embodied Minds  ·  Chapter 16 of 36
Chapter 16

RLHF, DPO, GRPO, and reasoning RL

The post-training stack, 2024–2026

DPO
closed-form preference optimization — no reward model, no PPO rollouts
GRPO
group-relative PPO with verifiable rewards — the DeepSeek-R1 recipe
2025
when process reward models (PRMs) became the frontier for reasoning alignment
Maturity ladder

The post-training stack has been through three complete reversals since 2022. RLHF (Christiano et al., 2017; operationalized by InstructGPT, Ouyang et al., arXiv:2203.02155, March 2022) trained a reward model on human preference pairs and optimized the policy with PPO under a KL constraint to prevent the policy from drifting too far from the base model. It worked, but it required two separate models (policy + reward), human annotation pipelines, and PPO rollouts — expensive and unstable. DPO (Rafailov et al., arXiv:2305.18290, May 2023) showed that the entire RLHF objective has a closed-form solution that trains directly on preference pairs with no reward model and no RL. For most preference-tuning tasks, DPO is what you actually want. Then GRPO (DeepSeek-R1-Zero, arXiv:2501.12948, January 2025) brought RL back — not for preference tuning, but for tasks with verifiable correctness signals (math, code, structured reasoning). The field is now running both tracks in parallel. The meta-lesson: the right algorithm depends entirely on the reward signal you have. If you have human preference pairs and no ground truth, DPO. If you have a correctness oracle (unit tests, math verifier, structured output validator), GRPO or PPO with a rule-based reward. If you have step-by-step correctness labels, a process reward model (PRM) is the frontier. For DealLens, your reward signal is a GP's historical pass/invest decisions — that is structured enough for DPO. For the JHU humanoid, your reward signal is task completion in simulation — that is verifiable enough for GRPO.

RLHF with PPO: the original recipe

The RLHF objective is: maximize E[r(x,y)] - beta * KL(pi_theta || pi_ref), where r is a trained reward model and the KL term prevents the policy from deviating so far from the reference model that it collapses into reward hacking. PPO (Schulman et al., arXiv:1707.06347, 2017) optimizes this with clipped surrogate objectives and a value baseline. The reward model is trained separately on human preference pairs (y_w preferred over y_l given prompt x) using a Bradley-Terry or logistic model. InstructGPT (Ouyang et al., March 2022) was the first scaled application — 175B GPT-3 finetuned with 40 labelers, producing ChatGPT's lineage.

PPO stability in the LLM context is notoriously fragile: reward hacking (the policy learns to game the reward model rather than improve), KL instability, and the need for careful learning-rate scheduling across the policy, value, and reward networks. For most teams without Google-scale annotation budgets and ML infrastructure, PPO-based RLHF is a trap. The reward model is almost always the bottleneck — it is usually a smaller, less capable model that becomes the ceiling on alignment quality.

DPO: removing the reward model entirely

DPO (Rafailov et al., arXiv:2305.18290, May 29, 2023) derived the closed-form solution to the KL-constrained RLHF problem: the optimal policy satisfies pi*(y|x) ∝ pi_ref(y|x) * exp(r*(x,y) / beta). Substituting and rearranging, the reward is r*(x,y) = beta * log(pi*(y|x)/pi_ref(y|x)) + log Z(x). Plugging into the Bradley-Terry preference model and differentiating gives the DPO loss: -log sigma(beta * (log pi_theta(y_w|x)/pi_ref(y_w|x) - log pi_theta(y_l|x)/pi_ref(y_l|x))). Train on preference pairs with this loss. No reward model. No rollouts. Just supervised learning on the policy.

DPO is implemented in ~80 lines: for each (prompt, chosen, rejected) triplet, compute log probs from both the current policy and a frozen reference model (often a copy of the policy before fine-tuning), compute the log-ratio difference, and apply the DPO loss. HuggingFace TRL has a DPOTrainer that handles this. The implicit reward is extractable post-training and often outperforms explicitly trained reward models on the same data — because the policy itself is a better model than a separate reward head.

DPO's failure mode: it requires the preference pairs to be on-distribution with the policy. If your rejected responses are clearly wrong, the gradient signal is weak — the model already assigns them low probability. DPO on hard-negative pairs (responses that are plausible but subtly wrong) is more informative. Identity DPO (iPO), ORPO, and SimPO are later variants that fix specific instabilities in the original DPO formulation — each paper is a 20-minute read.

GRPO and the return of RL for reasoning

GRPO (Group Relative Policy Optimization, DeepSeek-R1-Zero, arXiv:2501.12948, January 22, 2025) differs from standard PPO in the baseline: instead of a learned value function, GRPO samples a group of G outputs for each prompt, evaluates each with a reward function, and uses the group-mean reward as the baseline. The advantage for output i is (r_i - mean(r_group)) / std(r_group). This eliminates the value network entirely — no separate critic, no value loss, simpler training infrastructure. The key requirement: the reward must be computable for every sampled output, which means you need a verifiable correctness oracle.

DeepSeek-R1-Zero applied GRPO to math (correctness verified by symbolic checker), code (unit test execution), and structured output (format validation). The result was spontaneous chain-of-thought — the model learned to produce extended internal reasoning without any reasoning supervision, purely from outcome rewards. This is the experimental result that made the paper landmark: zero-shot CoT from RL alone, on a base model. DeepSeek-R1 (the full model) added a supervised fine-tuning warmup on human-labeled reasoning chains before GRPO, which stabilized training and produced the clean Chain-of-Thought format.

Process reward models (PRMs) extend the verifiable reward idea to step-level feedback: instead of a single reward at the end, each reasoning step gets a correctness score. Lightman et al. (arXiv:2305.20050, May 2023) showed PRMs significantly outperform outcome reward models for math. The cost: you need step-level labels, which require human annotation or a much stronger model to auto-label. For 2026 frontier systems, PRMs are the state of the art for reasoning alignment — but out of reach for most teams without the annotation infrastructure.

Alignment for DealLens and the humanoid

For DealLens, the alignment problem is: given a GP's historical investment decisions (pass/invest labels on 200–500 deals), produce a scoring model that matches that GP's judgment. This is a preference problem with a verifiable signal — you know ground truth. DPO on (deal_memo, invest_response, pass_response) pairs is the right tool. The tricky part is constructing the pairs: you need both a good reasoning chain for pass and for invest on each deal, which requires either LLM-generated reasoning or human annotation. RLAIF (Bai et al., 2022) — using a stronger LLM as the preference labeler — is a pragmatic middle ground.

For the JHU humanoid, the alignment problem is different: you want the robot to prefer safe, smooth trajectories over jerky or collision-prone ones. GRPO in simulation is the right tool here — the reward is verifiable (did the task succeed? did any joint limit exceed threshold? was any object knocked over?). Each GRPO rollout is a simulated episode; the group baseline averages over 8–16 rollouts per prompt (task instruction + scene). The output is a policy that has been RL-finetuned on verifiable physical task success, on top of the base VLA weights.

RLAIF

Reinforcement Learning from AI Feedback (Bai et al., 2022) replaces human preference labelers with a stronger LLM judge. Quality is 80–90% of human-labeled RLHF at 1% of the cost. For DealLens with limited GP annotation time, RLAIF is the practical path.

Post-Training Stack: RLHF → DPO → GRPOPost-Training Algorithm Comparison: RLHF → DPO → GRPORLHF (PPO)Human preference pairsTrain reward modelPPO rollouts + KL penaltyAligned policyCostly: 2 models, annotationPPO instability riskDPOHuman preference pairsClosed-form loss (log-ratio)Aligned policyNo reward model, no rolloutsImplicit reward extractableBest for: preference tuningwith no correctness oracleGRPO (DeepSeek-R1)Verifiable reward oracleSample G outputs per promptGroup-relative advantageAligned + reasoning policyBest for: math, code, sim taskswith ground-truth correctness
Figure 16.1Three post-training algorithms for three reward signal regimes. RLHF (left) requires a separately trained reward model and PPO rollouts — highest quality, highest cost. DPO (center) trains directly on preference pairs with a closed-form loss — no RL. GRPO (right) samples groups of outputs and uses verifiable correctness as the reward — the recipe for reasoning-capable models like DeepSeek-R1.
Retrieve before you continue

Three questions on what you just read

Q1 Factual What is the DPO loss function, and what does each term represent?
Q2 Conceptual How does GRPO replace the PPO value baseline, and what does this require from the reward function?
Q3 Synthetic For aligning DealLens's scoring model to a specific GP's historical decisions, which algorithm is most appropriate and what is the key data requirement?