From Tokens to Embodied Minds  ·  Drill cards · Chapter 16
Drills

RLHF, DPO, GRPO, and reasoning RL

10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.

10 cards due for review

In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Tokens to Embodied Minds · Ch 16, note type = Basic.

FrontBack
What does RLHF stand for and what are its two key components?Reinforcement Learning from Human Feedback. Components: a reward model trained on human preference pairs, and PPO with a KL constraint to the reference policy.
What does DPO eliminate compared to RLHF?The explicit reward model and PPO rollouts — replaced by a closed-form loss on preference pairs.
What implicit mathematical object does a trained DPO policy define?An implicit reward function: r(x,y) = beta * log(pi_theta(y|x) / pi_ref(y|x)) + log Z(x).
What is the GRPO group baseline?The mean reward across G sampled outputs for the same prompt, used as the advantage baseline instead of a learned value function.
What type of reward signal does GRPO require?A verifiable correctness oracle — e.g., a math symbolic checker, unit test runner, or format validator — that can evaluate every sampled output.
What paper introduced GRPO and what model used it?DeepSeek-R1-Zero (arXiv:2501.12948, January 22, 2025) introduced GRPO and used it to train DeepSeek-R1's reasoning capabilities.
What is the difference between an outcome reward model and a process reward model (PRM)?Outcome: one reward at the end of a trajectory. PRM: per-step correctness scores during reasoning. PRMs give finer-grained training signal but require step-level annotation.
What is RLAIF?Reinforcement Learning from AI Feedback — using a stronger LLM (not human annotators) to generate preference labels, reducing annotation cost by ~100×.
What is the KL constraint in the RLHF objective for?To prevent reward hacking: the policy is penalized for diverging too far from the reference model, so it cannot arbitrarily maximize the reward by producing out-of-distribution responses.
Name two variants of DPO that address specific instabilities in the original formulation.Identity Preference Optimization (iPO) addresses over-optimization at the boundary, and SimPO uses sequence-length-normalized log probabilities to remove the reference model dependency entirely.