From Tokens to Embodied Minds · Drill cards · Chapter 16
Drills
RLHF, DPO, GRPO, and reasoning RL
10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.
In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Tokens to Embodied Minds · Ch 16, note type = Basic.
| Front | Back |
|---|---|
| What does RLHF stand for and what are its two key components? | Reinforcement Learning from Human Feedback. Components: a reward model trained on human preference pairs, and PPO with a KL constraint to the reference policy. |
| What does DPO eliminate compared to RLHF? | The explicit reward model and PPO rollouts — replaced by a closed-form loss on preference pairs. |
| What implicit mathematical object does a trained DPO policy define? | An implicit reward function: r(x,y) = beta * log(pi_theta(y|x) / pi_ref(y|x)) + log Z(x). |
| What is the GRPO group baseline? | The mean reward across G sampled outputs for the same prompt, used as the advantage baseline instead of a learned value function. |
| What type of reward signal does GRPO require? | A verifiable correctness oracle — e.g., a math symbolic checker, unit test runner, or format validator — that can evaluate every sampled output. |
| What paper introduced GRPO and what model used it? | DeepSeek-R1-Zero (arXiv:2501.12948, January 22, 2025) introduced GRPO and used it to train DeepSeek-R1's reasoning capabilities. |
| What is the difference between an outcome reward model and a process reward model (PRM)? | Outcome: one reward at the end of a trajectory. PRM: per-step correctness scores during reasoning. PRMs give finer-grained training signal but require step-level annotation. |
| What is RLAIF? | Reinforcement Learning from AI Feedback — using a stronger LLM (not human annotators) to generate preference labels, reducing annotation cost by ~100×. |
| What is the KL constraint in the RLHF objective for? | To prevent reward hacking: the policy is penalized for diverging too far from the reference model, so it cannot arbitrarily maximize the reward by producing out-of-distribution responses. |
| Name two variants of DPO that address specific instabilities in the original formulation. | Identity Preference Optimization (iPO) addresses over-optimization at the boundary, and SimPO uses sequence-length-normalized log probabilities to remove the reference model dependency entirely. |