From Tokens to Embodied Minds · Drill cards · Chapter 02
Drills
Probability, entropy, KL
10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.
In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Tokens to Embodied Minds · Ch 02, note type = Basic.
| Front | Back |
|---|---|
| State the relationship between cross-entropy and perplexity. | Perplexity = exp(cross-entropy). Cross-entropy H(p,q) measures average bits per token; perplexity exp(H(p,q)) is the effective vocabulary size the model is confused among at each step. |
| Why is forward KL (KL(teacher||student)) preferred over reverse KL in knowledge distillation? | Forward KL is mean-seeking — the student spreads mass wherever the teacher has mass, preserving all valid continuations. Reverse KL is mode-seeking — the student collapses to one mode and ignores other valid outputs. |
| Write the RLHF objective including the KL penalty term. | J = E_pi[r(y|x)] - beta * KL(pi(y|x) || pi_ref(y|x)). Beta controls the trade-off between reward maximization and staying close to the reference policy. |
| What is the role of temperature in distillation soft targets? | Temperature T > 1 softens the teacher's distribution before computing KL, spreading probability mass to non-top tokens. This gives the student richer signal about the teacher's beliefs beyond just the argmax. |
| Define Shannon entropy and its relationship to compression. | H(p) = -sum_x p(x) log p(x). Shannon's source coding theorem states H(p) is the minimum average bits needed to encode a message from distribution p — entropy equals optimal compression length. |
| What does it mean operationally if a language model's output entropy collapses during fine-tuning? | The model is mode-collapsing — assigning near-1.0 probability to one token at each step. This typically signals overfitting, too-high learning rate, or insufficient regularization (e.g., missing KL penalty in RLHF). |
| Why did GRPO (DeepSeek-R1-Zero, January 2025) bring RL back after DPO had largely replaced it? | DPO requires preference pairs (human or AI labels). GRPO exploits verifiable rewards — math correctness, code execution success — where a scalar signal is available without human labeling. For reasoning tasks, this signal is stronger and cheaper than preference data. |
| What is KL(p||q) and what happens when q assigns zero probability to something p assigns positive probability? | KL(p||q) = sum_x p(x) log(p(x)/q(x)). If q(x)=0 and p(x)>0, the term diverges to infinity — KL is undefined (infinite). This is why language model training uses smoothed distributions and why vocabulary coverage matters. |
| How does the DPO loss use the log policy ratio? | DPO loss = -log sigmoid(beta * (log pi(y_w)/pi_ref(y_w) - log pi(y_l)/pi_ref(y_l))), where y_w is preferred and y_l is dispreferred. It pushes the ratio higher for preferred completions and lower for dispreferred ones. |
| Why is perplexity still reported in 2025 despite better downstream benchmarks? | Perplexity is tied directly to the information-theoretic loss and is benchmark-designer-independent. It measures compression quality without requiring task-specific choices, making it a clean model-quality signal for comparing pretraining runs. |