The post-training stack has been through three complete reversals since 2022. RLHF (Christiano et al., 2017; operationalized by InstructGPT, Ouyang et al., arXiv:2203.02155, March 2022) trained a reward model on human preference pairs and optimized the policy with PPO under a KL constraint to prevent the policy from drifting too far from the base model. It worked, but it required two separate models (policy + reward), human annotation pipelines, and PPO rollouts — expensive and unstable. DPO (Rafailov et al., arXiv:2305.18290, May 2023) showed that the entire RLHF objective has a closed-form solution that trains directly on preference pairs with no reward model and no RL. For most preference-tuning tasks, DPO is what you actually want. Then GRPO (DeepSeek-R1-Zero, arXiv:2501.12948, January 2025) brought RL back — not for preference tuning, but for tasks with verifiable correctness signals (math, code, structured reasoning). The field is now running both tracks in parallel. The meta-lesson: the right algorithm depends entirely on the reward signal you have. If you have human preference pairs and no ground truth, DPO. If you have a correctness oracle (unit tests, math verifier, structured output validator), GRPO or PPO with a rule-based reward. If you have step-by-step correctness labels, a process reward model (PRM) is the frontier. For DealLens, your reward signal is a GP's historical pass/invest decisions — that is structured enough for DPO. For the JHU humanoid, your reward signal is task completion in simulation — that is verifiable enough for GRPO.
RLHF with PPO: the original recipe
The RLHF objective is: maximize E[r(x,y)] - beta * KL(pi_theta || pi_ref), where r is a trained reward model and the KL term prevents the policy from deviating so far from the reference model that it collapses into reward hacking. PPO (Schulman et al., arXiv:1707.06347, 2017) optimizes this with clipped surrogate objectives and a value baseline. The reward model is trained separately on human preference pairs (y_w preferred over y_l given prompt x) using a Bradley-Terry or logistic model. InstructGPT (Ouyang et al., March 2022) was the first scaled application — 175B GPT-3 finetuned with 40 labelers, producing ChatGPT's lineage.
PPO stability in the LLM context is notoriously fragile: reward hacking (the policy learns to game the reward model rather than improve), KL instability, and the need for careful learning-rate scheduling across the policy, value, and reward networks. For most teams without Google-scale annotation budgets and ML infrastructure, PPO-based RLHF is a trap. The reward model is almost always the bottleneck — it is usually a smaller, less capable model that becomes the ceiling on alignment quality.
DPO: removing the reward model entirely
DPO (Rafailov et al., arXiv:2305.18290, May 29, 2023) derived the closed-form solution to the KL-constrained RLHF problem: the optimal policy satisfies pi*(y|x) ∝ pi_ref(y|x) * exp(r*(x,y) / beta). Substituting and rearranging, the reward is r*(x,y) = beta * log(pi*(y|x)/pi_ref(y|x)) + log Z(x). Plugging into the Bradley-Terry preference model and differentiating gives the DPO loss: -log sigma(beta * (log pi_theta(y_w|x)/pi_ref(y_w|x) - log pi_theta(y_l|x)/pi_ref(y_l|x))). Train on preference pairs with this loss. No reward model. No rollouts. Just supervised learning on the policy.
DPO is implemented in ~80 lines: for each (prompt, chosen, rejected) triplet, compute log probs from both the current policy and a frozen reference model (often a copy of the policy before fine-tuning), compute the log-ratio difference, and apply the DPO loss. HuggingFace TRL has a DPOTrainer that handles this. The implicit reward is extractable post-training and often outperforms explicitly trained reward models on the same data — because the policy itself is a better model than a separate reward head.
DPO's failure mode: it requires the preference pairs to be on-distribution with the policy. If your rejected responses are clearly wrong, the gradient signal is weak — the model already assigns them low probability. DPO on hard-negative pairs (responses that are plausible but subtly wrong) is more informative. Identity DPO (iPO), ORPO, and SimPO are later variants that fix specific instabilities in the original DPO formulation — each paper is a 20-minute read.
GRPO and the return of RL for reasoning
GRPO (Group Relative Policy Optimization, DeepSeek-R1-Zero, arXiv:2501.12948, January 22, 2025) differs from standard PPO in the baseline: instead of a learned value function, GRPO samples a group of G outputs for each prompt, evaluates each with a reward function, and uses the group-mean reward as the baseline. The advantage for output i is (r_i - mean(r_group)) / std(r_group). This eliminates the value network entirely — no separate critic, no value loss, simpler training infrastructure. The key requirement: the reward must be computable for every sampled output, which means you need a verifiable correctness oracle.
DeepSeek-R1-Zero applied GRPO to math (correctness verified by symbolic checker), code (unit test execution), and structured output (format validation). The result was spontaneous chain-of-thought — the model learned to produce extended internal reasoning without any reasoning supervision, purely from outcome rewards. This is the experimental result that made the paper landmark: zero-shot CoT from RL alone, on a base model. DeepSeek-R1 (the full model) added a supervised fine-tuning warmup on human-labeled reasoning chains before GRPO, which stabilized training and produced the clean Chain-of-Thought format.
Process reward models (PRMs) extend the verifiable reward idea to step-level feedback: instead of a single reward at the end, each reasoning step gets a correctness score. Lightman et al. (arXiv:2305.20050, May 2023) showed PRMs significantly outperform outcome reward models for math. The cost: you need step-level labels, which require human annotation or a much stronger model to auto-label. For 2026 frontier systems, PRMs are the state of the art for reasoning alignment — but out of reach for most teams without the annotation infrastructure.
Alignment for DealLens and the humanoid
For DealLens, the alignment problem is: given a GP's historical investment decisions (pass/invest labels on 200–500 deals), produce a scoring model that matches that GP's judgment. This is a preference problem with a verifiable signal — you know ground truth. DPO on (deal_memo, invest_response, pass_response) pairs is the right tool. The tricky part is constructing the pairs: you need both a good reasoning chain for pass and for invest on each deal, which requires either LLM-generated reasoning or human annotation. RLAIF (Bai et al., 2022) — using a stronger LLM as the preference labeler — is a pragmatic middle ground.
For the JHU humanoid, the alignment problem is different: you want the robot to prefer safe, smooth trajectories over jerky or collision-prone ones. GRPO in simulation is the right tool here — the reward is verifiable (did the task succeed? did any joint limit exceed threshold? was any object knocked over?). Each GRPO rollout is a simulated episode; the group baseline averages over 8–16 rollouts per prompt (task instruction + scene). The output is a policy that has been RL-finetuned on verifiable physical task success, on top of the base VLA weights.
Reinforcement Learning from AI Feedback (Bai et al., 2022) replaces human preference labelers with a stronger LLM judge. Quality is 80–90% of human-labeled RLHF at 1% of the cost. For DealLens with limited GP annotation time, RLAIF is the practical path.