From Tokens to Embodied Minds · Drill cards · Chapter 30
Drills
Diffusion policies and action chunking
10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.
In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Tokens to Embodied Minds · Ch 30, note type = Basic.
| Front | Back |
|---|---|
| Why does Gaussian behavioral cloning fail on multimodal action distributions? | The Gaussian mean falls between the modes — the predicted action is the average of two valid but distinct behaviors, which usually executes neither. |
| What does the DDPM forward process do to a demonstration action? | Gradually adds Gaussian noise across T steps, turning a clean action a_0 into approximately pure Gaussian noise a_T. |
| What does the diffusion policy noise-prediction network learn? | To predict the noise epsilon_theta(a_t, t, o) that was added at step t to the action, conditioned on robot observation o. This is the score function of the noisy distribution. |
| What is action chunking? | Predicting K future actions at once and executing them open-loop for K steps before re-querying the policy. Reduces required inference rate from ~20Hz to ~2Hz for K=8. |
| What is the key advantage of flow matching over DDPM for inference? | Flow matching requires 10-25 ODE steps vs 50-100 DDPM steps, giving faster inference with comparable or better mode coverage. |
| Which VLA architectures use flow matching as their action head? | π0 (Physical Intelligence, Feb 4, 2025), π0.5 (April 22, 2025), and SmolVLA (HuggingFace, June 3, 2025). |
| What is the Push-T benchmark? | A 2D manipulation task where a robot arm must push a T-shaped block to a target configuration. Standard benchmark for diffusion policy evaluation from Chi et al. (2023). |
| What is the primary paper for diffusion policies? | Diffusion Policy: Visuomotor Policy Learning via Action Diffusion, Chi et al., arXiv:2303.04137, March 7, 2023. |
| What is the difference between DDPM and DDIM for inference? | DDPM uses stochastic denoising (adds noise at each step), requiring 50-1000 steps. DDIM (Denoising Diffusion Implicit Models) uses deterministic denoising, enabling high-quality samples in 10-50 steps — a common acceleration for diffusion policies. |
| In the JHU capstone, if SmolVLA fails and you suspect the action expert, what do you check first? | Log the raw flow-matching action chunk output and compare to a demonstration action for the same observation. If the sampled action is semantically wrong (wrong direction, wrong gripper state), the flow-matching expert is failing — check number of denoising steps, action conditioning, and whether the demonstration dataset has sufficient multimodal coverage. |