Diffusion policies and action chunking

10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.

In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Tokens to Embodied Minds · Ch 30, note type = Basic.

Front	Back
Why does Gaussian behavioral cloning fail on multimodal action distributions?	The Gaussian mean falls between the modes — the predicted action is the average of two valid but distinct behaviors, which usually executes neither.
What does the DDPM forward process do to a demonstration action?	Gradually adds Gaussian noise across T steps, turning a clean action a_0 into approximately pure Gaussian noise a_T.
What does the diffusion policy noise-prediction network learn?	To predict the noise epsilon_theta(a_t, t, o) that was added at step t to the action, conditioned on robot observation o. This is the score function of the noisy distribution.
What is action chunking?	Predicting K future actions at once and executing them open-loop for K steps before re-querying the policy. Reduces required inference rate from ~20Hz to ~2Hz for K=8.
What is the key advantage of flow matching over DDPM for inference?	Flow matching requires 10-25 ODE steps vs 50-100 DDPM steps, giving faster inference with comparable or better mode coverage.
Which VLA architectures use flow matching as their action head?	π0 (Physical Intelligence, Feb 4, 2025), π0.5 (April 22, 2025), and SmolVLA (HuggingFace, June 3, 2025).
What is the Push-T benchmark?	A 2D manipulation task where a robot arm must push a T-shaped block to a target configuration. Standard benchmark for diffusion policy evaluation from Chi et al. (2023).
What is the primary paper for diffusion policies?	Diffusion Policy: Visuomotor Policy Learning via Action Diffusion, Chi et al., arXiv:2303.04137, March 7, 2023.
What is the difference between DDPM and DDIM for inference?	DDPM uses stochastic denoising (adds noise at each step), requiring 50-1000 steps. DDIM (Denoising Diffusion Implicit Models) uses deterministic denoising, enabling high-quality samples in 10-50 steps — a common acceleration for diffusion policies.
In the JHU capstone, if SmolVLA fails and you suspect the action expert, what do you check first?	Log the raw flow-matching action chunk output and compare to a demonstration action for the same observation. If the sampled action is semantically wrong (wrong direction, wrong gripper state), the flow-matching expert is failing — check number of denoising steps, action conditioning, and whether the demonstration dataset has sufficient multimodal coverage.