Predict before you read

Before you read — why does a Gaussian action head fail on the task of 'pick up a bottle from a cluttered table where it can be grasped from left or right'?

Think about what happens when a unimodal distribution tries to represent two equally valid modes.

From Tokens to Embodied Minds  ·  Chapter 30 of 36
Chapter 30

Diffusion policies and action chunking

Why diffusion beat behavioral cloning for manipulation

K
future actions predicted at once (action chunking) — executed open-loop for K steps before re-querying the policy
DDPM
denoising diffusion probabilistic model — the generative backbone that replaced Gaussian action heads
flow
matching — the continuous-time diffusion variant used by π0, π0.5, and SmolVLA's action expert
Maturity ladder

Behavioral cloning with a Gaussian action head is the obvious baseline for robot learning from demonstrations: regress from observations to actions using MSE loss. It fails precisely on the tasks that matter: manipulation problems where multiple action sequences are equally valid (grasp left or grasp right, stack blue then red or red then blue). The Gaussian mean falls between the modes and the robot grasps nothing. Diffusion policies (Chi et al., arXiv:2303.04137, March 7, 2023) solve this by representing the action distribution as a learned denoising diffusion model conditioned on the observation. The reverse process starts from Gaussian noise and iteratively denoises to a valid action — sampling from the full, multimodal distribution rather than collapsing it to a mean. Action chunking — predicting K future actions at once and executing them open-loop — makes the inference frequency feasible despite the multi-step denoising. π0 and π0.5 (Physical Intelligence) use flow matching, a continuous-time variant of the same idea, which trains faster and runs in fewer inference steps.

The multimodal failure and the diffusion fix

The behavioral cloning failure on multimodal tasks is not a data problem or a network-capacity problem — it is a distributional mismatch. Standard BC with MSE loss trains the network to minimize the squared error between predicted and demonstrated actions, which is equivalent to fitting a Gaussian distribution to the action space. When the demonstration data contains two equally valid action modes (e.g., from 50 demonstrations, 25 grasp left and 25 grasp right), the Gaussian mean is between the modes. At inference time, the policy predicts this average action, which executes neither grasp.

Diffusion policies (Chi et al., March 7, 2023) replace the Gaussian head with a denoising diffusion probabilistic model (DDPM). The forward process gradually adds Gaussian noise to a demonstration action a_0 across T steps: a_T is nearly pure noise. The reverse process trains a noise-prediction network epsilon_theta(a_t, t, o) to predict the noise added at step t, conditioned on the current observation o. At inference, you sample a_T ~ N(0,I) and iteratively denoise with the trained network to recover a valid action a_0. Because the reverse process is a learned sampler over the full action distribution, it correctly captures both modes — sampling from each with probability proportional to their demonstration frequency.

Action chunking — making inference feasible

Diffusion inference requires T denoising steps (typically 10-100 for DDPM, 1-10 for DDIM or flow matching). At 10 steps on a modern GPU, this takes ~50ms per inference — at 20Hz, borderline feasible for manipulation, but with no slack. Action chunking addresses this: instead of predicting a single action, predict a chunk of K future actions (K=8 to 16 typically) and execute them open-loop before re-querying the policy. This reduces the required policy inference rate by a factor of K, from 20Hz to 1.25-2.5Hz — well within the latency budget even for DDPM.

The trade-off is open-loop execution: the policy commits to K future actions without feedback. For smooth, predictable tasks (place object in bin), this is fine. For contact-rich, reactive tasks (opening a stuck drawer), it can accumulate error. ACT (Action Chunking with Transformers, Zhao et al., 2023) uses a similar approach with a deterministic transformer policy. π0 and π0.5 use flow matching — a continuous-time ODE-based variant of diffusion — that requires only 10-25 inference steps vs 50-100 for DDPM, making action chunking less critical while still capturing multimodal distributions.

Flow matching — the π0 variant

Flow matching (Lipman et al., 2022; Liu et al., 2022) defines a probability path from noise distribution to data distribution via an ordinary differential equation (ODE) rather than a stochastic process. The model learns to predict the velocity field v_theta(x, t) that transports Gaussian noise toward valid actions. Benefits over DDPM: simpler training objective (conditional flow matching loss), faster inference (10-25 ODE steps vs 50-100 DDPM steps), and better mode coverage in practice. π0 (Physical Intelligence, Feb 4, 2025) uses flow matching as the action expert on top of a VLM backbone. π0.5 (arXiv:2504.16054, April 22, 2025) continues with flow matching. SmolVLA (HuggingFace, June 3, 2025) uses a flow-matching action expert adapted for the 450M parameter budget.

For the JHU capstone, SmolVLA's action expert is a flow-matching diffusion policy — so understanding this chapter is prerequisite to understanding SmolVLA's failure modes. When SmolVLA fails on a task, the question to ask is: is the failure in the VLM backbone (wrong instruction interpretation), the flow-matching action expert (wrong action distribution), or the controller beneath it (execution failure)? Diffusion policy fluency is what lets you diagnose the middle layer.

DDPM vs flow matching in 2025

DDPM is the pedagogically cleaner starting point — the forward/reverse process is easy to visualize. For production VLA work, prefer flow matching: fewer inference steps (10-25 vs 50-100), simpler loss, better mode coverage. SmolVLA, π0, and π0.5 all use flow matching.

Diffusion Policy: Forward (Noising) and Reverse (Denoising) Processa_0Demo actiona_t/4+ noisea_t/2+ more noisea_TPure noiseForward process (training): add noise q(a_t | a_{t-1})a_0Sampled actiona_1Denoise stepa_t/2Mid-denoisea_T ~ N(0,I)Start: sample noiseReverse process (inference): denoise conditioned on observation oNoise Prediction Networkepsilon_theta(a_t, t, o)Observation o:RGB camera + proprioceptionFlow matching (π0, SmolVLA):v_theta(x, t, o) — ODE velocity10-25 steps vs 50-100 (DDPM)Action chunk K=8 → execute open-loop
Figure 30.1Diffusion policy: forward (noising) and reverse (denoising) processes. Training adds noise to demonstration actions; inference denoises from Gaussian noise to a valid action, conditioned on robot observation. Action chunking (K=8) reduces the required inference rate. Flow matching (π0, SmolVLA) replaces DDPM with an ODE velocity field, requiring 10-25 steps vs 50-100.
Retrieve before you continue

Three questions on what you just read

Q1 Factual Describe the DDPM reverse process for a diffusion policy and what it is conditioned on.
Q2 Conceptual Why does action chunking reduce the effective inference frequency requirement for diffusion policies?
Q3 Synthetic SmolVLA fails on a pick-and-place task in your humanoid capstone. How do you determine whether the failure is in the flow-matching action expert vs the VLM backbone?