Probability, entropy, KL

10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.

In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Tokens to Embodied Minds · Ch 02, note type = Basic.

Front	Back
State the relationship between cross-entropy and perplexity.	Perplexity = exp(cross-entropy). Cross-entropy H(p,q) measures average bits per token; perplexity exp(H(p,q)) is the effective vocabulary size the model is confused among at each step.
Why is forward KL (KL(teacher\|\|student)) preferred over reverse KL in knowledge distillation?	Forward KL is mean-seeking — the student spreads mass wherever the teacher has mass, preserving all valid continuations. Reverse KL is mode-seeking — the student collapses to one mode and ignores other valid outputs.
Write the RLHF objective including the KL penalty term.	J = E_pi[r(y\|x)] - beta * KL(pi(y\|x) \|\| pi_ref(y\|x)). Beta controls the trade-off between reward maximization and staying close to the reference policy.
What is the role of temperature in distillation soft targets?	Temperature T > 1 softens the teacher's distribution before computing KL, spreading probability mass to non-top tokens. This gives the student richer signal about the teacher's beliefs beyond just the argmax.
Define Shannon entropy and its relationship to compression.	H(p) = -sum_x p(x) log p(x). Shannon's source coding theorem states H(p) is the minimum average bits needed to encode a message from distribution p — entropy equals optimal compression length.
What does it mean operationally if a language model's output entropy collapses during fine-tuning?	The model is mode-collapsing — assigning near-1.0 probability to one token at each step. This typically signals overfitting, too-high learning rate, or insufficient regularization (e.g., missing KL penalty in RLHF).
Why did GRPO (DeepSeek-R1-Zero, January 2025) bring RL back after DPO had largely replaced it?	DPO requires preference pairs (human or AI labels). GRPO exploits verifiable rewards — math correctness, code execution success — where a scalar signal is available without human labeling. For reasoning tasks, this signal is stronger and cheaper than preference data.
What is KL(p\|\|q) and what happens when q assigns zero probability to something p assigns positive probability?	KL(p\|\|q) = sum_x p(x) log(p(x)/q(x)). If q(x)=0 and p(x)>0, the term diverges to infinity — KL is undefined (infinite). This is why language model training uses smoothed distributions and why vocabulary coverage matters.
How does the DPO loss use the log policy ratio?	DPO loss = -log sigmoid(beta * (log pi(y_w)/pi_ref(y_w) - log pi(y_l)/pi_ref(y_l))), where y_w is preferred and y_l is dispreferred. It pushes the ratio higher for preferred completions and lower for dispreferred ones.
Why is perplexity still reported in 2025 despite better downstream benchmarks?	Perplexity is tied directly to the information-theoretic loss and is benchmark-designer-independent. It measures compression quality without requiring task-specific choices, making it a clean model-quality signal for comparing pretraining runs.