Predict before you read

In DPO (Rafailov et al., 2023), which quantity replaces the explicit reward model that RLHF requires?

DPO derives a closed-form solution — think about what mathematical object it optimizes directly.

From Tokens to Embodied Minds  ·  Chapter 02 of 36
Chapter 02

Probability, entropy, KL

Why next-token prediction is just compression

3
places KL divergence reappears after pretraining
H(p,q)
cross-entropy — the only loss that survives
50lines
to implement DPO from scratch and verify the implicit reward
Maturity ladder

Next-token prediction is compression. The model learns a distribution over the next token given context, and minimizing cross-entropy is identical to maximizing the likelihood of the training data under that distribution. This is not a metaphor — Shannon's source coding theorem directly connects bits-per-character to cross-entropy, and perplexity is literally the exponentiated average cross-entropy. If you understand this once, the loss function never needs explaining again. KL divergence is then the leash that appears three times after pretraining. In RLHF it penalizes the policy from drifting too far from the reference model. In DPO (Rafailov et al., arXiv:2305.18290, May 2023) it becomes the closed-form solution that removes the reward model entirely. In distillation it is the objective that forces the student to match the teacher's full output distribution rather than just its argmax. Understanding KL once collapses three seemingly separate topics into one.

Next-token prediction as compression

The cross-entropy loss L = -sum_t log p_theta(x_t | x_{

Perplexity — exp(H(p, q)) — is cross-entropy in a unit that makes the scale intuitive: a perplexity of 10 means the model is on average as confused as if it had to choose uniformly among 10 equally likely tokens at every step. A perplexity of 2 would be near-optimal English. The reason perplexity is still reported in 2025 despite better downstream benchmarks is that it is directly tied to the information-theoretic loss and does not require a benchmark designer's choices about tasks.

For DealLens, perplexity has a direct operational reading: if you use a language model as a scoring function over investment memos — assigning a log-probability to a memo under a domain-adapted model — you are computing a cross-entropy relative to the model's learned distribution over good VC prose. Calibrating this against human scores gives you a signal that does not require hand-crafted features.

The three places KL reappears

RLHF (Ouyang et al., arXiv:1706.03741, InstructGPT, March 2022) maximizes expected reward while penalizing KL divergence from the reference policy: J = E[r(y|x)] - beta * KL(pi || pi_ref). The beta parameter controls the trade-off — too small and the policy drifts into reward hacking; too large and the policy never improves. The KL term is not a regularizer added for stability; it is fundamental to ensuring the policy remains a useful language model rather than a degenerate reward-maximizer.

DPO (Rafailov et al., arXiv:2305.18290, May 2023) observes that the optimal policy for the RLHF objective has a closed form: pi*(y|x) = pi_ref(y|x) * exp(r(y,x)/beta) / Z(x). Rearranging, the reward is recoverable from the policy ratio: r(y,x) = beta * log(pi(y|x)/pi_ref(y|x)) + log Z(x). The DPO loss directly maximizes the likelihood of preferred over dispreferred completions using this ratio, bypassing the reward model. GRPO (DeepSeek-R1-Zero, January 2025) later brought RL back for verifiable reward tasks — math and code — where a scalar correctness signal is available.

Knowledge distillation (Hinton et al., arXiv:1503.02531, March 2015) minimizes KL(p_teacher || p_student) — note the direction, which is forward KL. Forward KL is mean-seeking: the student spreads mass wherever the teacher has mass. Reverse KL (p_student || p_teacher) is mode-seeking: the student collapses to one mode of the teacher. For language model distillation you want forward KL because collapsing to one mode produces a student that ignores valid continuations the teacher assigns probability to. The practical implementation uses soft targets at temperature T > 1 to spread the teacher's distribution before computing the KL.

Entropy and calibration as operational signals

A model that assigns near-uniform probability to every token has high entropy and is poorly calibrated. A model that assigns probability 1.0 to one token at every step has zero entropy and is overfit. The output entropy of a language model is a practical diagnostic: if your DealLens scoring model consistently assigns very low entropy to its outputs, it is likely mode-collapsed or temperature-scaled incorrectly, and the variance of scores across different memos will be artificially compressed.

Temperature scaling at inference — dividing logits by T before softmax — is the simplest calibration intervention. T > 1 increases entropy (softer distribution), T < 1 sharpens it. For DealLens ensemble scoring, running multiple model variants at different temperatures and taking a weighted vote is a calibration strategy that consistently outperforms single-model argmax in low-data settings.

For the JHU humanoid capstone, this chapter is a direct prerequisite for any post-training you run on GR00T N1.5 or SmolVLA. The FLARE loss used in GR00T N1.5 training (NVIDIA Research, June 2025) is a variant of behavior cloning that minimizes cross-entropy over action tokens — the same loss function, applied to a different output space. If you want to fine-tune with online RL in Isaac Lab, the KL penalty from this chapter reappears as the beta parameter in the GRPO objective used for robot skill learning.

For DealLens, the distillation framing is directly applicable: if you run a large frontier model (GPT-4o or Claude Sonnet) as a teacher scorer on a labeled memo dataset, you can distill a smaller domain-adapted model that matches the teacher's output distribution. The student is cheaper to run at inference time and can be further fine-tuned on your own deal data without catastrophic forgetting if you maintain the KL constraint.

DPO vs GRPO in 2025

DPO dominated 2023-2024 for instruction-following tasks. GRPO (DeepSeek-R1-Zero, January 2025) brought RL back for math and code where a verifiable scalar reward is available. The field has not converged — choose based on whether your reward is verifiable or preference-based.

KL Divergence — One Object, Three ContextsRLHFJ = E[r(y)] - β·KL(π||π_ref)KL = leash on policy driftDPOr(y) = β·log(π/π_ref) + log ZKL = implicit in policy ratioDistillationmin KL(p_teacher || p_student)Forward KL = mean-seekingCross-Entropy = CompressionH(p,q) = H(p) + KL(p||q) · Minimize CE = minimize excess description length
Figure 2.1KL divergence appears as the leash in RLHF, the implicit signal in DPO, and the training objective in distillation — all three reduce to the same information-theoretic object.
Retrieve before you continue

Three questions on what you just read

Q1 Factual Why is cross-entropy loss equivalent to maximum-likelihood estimation for language models?
Q2 Conceptual Why does DPO not require a reward model, and what does it optimize instead?
Q3 Synthetic How would you use KL divergence operationally in DealLens to detect when a fine-tuned scoring model has drifted from the base model?