Predict before you read

Before you read — π0.5 generalizes to entirely new homes for 10-15 minute household tasks. What is the main mechanism — more robot data, better architecture, or a different training data mix?

Check what percentage of π0.5's pretraining is robot manipulation data vs other data.

From Tokens to Embodied Minds  ·  Chapter 32 of 36
Chapter 32

π0, π0.5, and SmolVLA

The 2025 open-source VLA family

78.3%
SmolVLA real-world task success after pretraining on community LeRobot datasets (Jun 3, 2025)
450M
SmolVLA parameter count — trainable on a single consumer GPU, deployable on a MacBook
97.6%
of π0.5 pretraining is NOT mobile manipulation — web data + semantic prediction drives open-world generalization
Maturity ladder

π0 (Physical Intelligence, February 4, 2025) changed the VLA architecture in two ways: it replaced action tokenization with a flow-matching action expert (a separate denoising network conditioned on the VLM's output), and it demonstrated dexterous manipulation across multiple embodiments — folding laundry, assembling boxes, bus-tray manipulation. The action expert captures multimodal action distributions that discrete tokenization cannot represent. π0.5 (Physical Intelligence, arXiv:2504.16054, April 22, 2025) pushed the frontier further: by co-training on heterogeneous data where 97.6% of pretraining is not mobile manipulation, the model generalizes to entirely new homes for 10-15 minute tasks like 'clean the bedroom.' SmolVLA-450M (Hugging Face, June 3, 2025) takes the opposite path: the smallest practical VLA, trainable on a single consumer GPU, achieving 78.3% real-world success on LeRobot community datasets, with asynchronous inference that gives 30% faster response and 2x task throughput.

π0 — VLM backbone + flow-matching action expert

π0's architecture separates reasoning from action generation. A VLM backbone (based on PaliGemma 3B) processes the image observation and language instruction, producing a context embedding. A separate flow-matching action expert takes this context embedding and denoises a Gaussian noise sample into an action chunk (K=16 future actions). The denoising model is a small diffusion transformer (DiT) conditioned on the VLM context. This separation allows the VLM to be updated with new language data without destabilizing the action expert, and allows the action expert to be fine-tuned on manipulation data without affecting the VLM's language capabilities.

The flow-matching objective (Lipman et al., 2022) trains the action expert to predict the velocity field that transports Gaussian noise toward demonstration actions. At inference, a 10-25 step ODE integration produces the action chunk. This is significantly faster than DDPM (50-100 steps) and produces smoother trajectories. π0's pretrained action expert demonstrates dexterous manipulation — folding clothes, multi-step assembly — that is qualitatively beyond what discrete action tokenization (RT-2 style) can represent.

π0.5 — open-world generalization

π0.5 (arXiv:2504.16054, April 22, 2025) co-trains on five data types: robot trajectory data (2.4% of pretraining), web video data, verbal instruction following, high-level semantic scene prediction, and internet text-image pairs. The 97.6% non-robot data is not incidental — it is the core of the architecture's generalization strategy. By training on diverse semantic tasks, the model develops representations that transfer to new homes without requiring any demonstrations from those environments.

The practical result: π0.5 can perform 10-15 minute multi-step household tasks (clean the bedroom, load the dishwasher) in entirely new environments. The model uses a high-level language planner (leveraging the VLM backbone) to decompose the task, and the action expert to execute each primitive step. The planner is what generalizes — it reasons about the new environment using the web-scale semantic knowledge. The action expert fine-tunes relatively quickly on a small number of demonstrations from the new environment. For the JHU capstone, π0.5 represents the capability ceiling for open-world household tasks — currently not open-source.

SmolVLA-450M — the consumer-GPU VLA

SmolVLA-450M (Hugging Face, June 3, 2025) is an efficiency-first VLA: 450M total parameters (SmolVLM-256M vision-language backbone + flow-matching action expert), trainable on a single 12GB consumer GPU (RTX 3080/4070), deployable on a MacBook for inference. Pretrained on community LeRobot datasets from HuggingFace Hub — a diverse set of publicly contributed manipulation demonstrations — SmolVLA achieves 78.3% real-world task success before any task-specific fine-tuning.

SmolVLA's async inference is the key latency innovation: the VLM backbone runs in one thread producing context embeddings, the flow-matching action expert runs in a parallel thread denoising action chunks. This eliminates the VLM's per-step latency from the action execution critical path — the robot executes the current action chunk while the VLM already processes the next observation. Result: 30% faster response time and 2x task throughput vs synchronous inference. For the JHU capstone, SmolVLA is the most likely starting point given the consumer-GPU constraint — the build for this chapter is the direct precursor to the capstone fine-tuning.

π0.5 vs SmolVLA for the capstone

π0.5 is the capability ceiling — not open-source. SmolVLA is the practical floor — open-source, consumer GPU, community pretrained. Start with SmolVLA; reference π0.5 as the ablation target for what better data coverage achieves.

π0 → π0.5 → SmolVLA: The 2025 VLA Familyπ0 (Feb 4, 2025)PaliGemma 3B VLM backbone+ Flow-matching action expert(DiT, K=16 chunk)Dexterous manipulationMultiple embodimentsNot open-sourceKey: separates reasoningfrom action generationπ0.5 (Apr 22, 2025)arXiv:2504.16054Heterogeneous data co-training97.6% non-manipulation dataOpen-world generalizationNew homes, 10-15 min tasksCapability ceilingNot open-sourceSmolVLA-450M (Jun 3, 2025)SmolVLM-256M + flow action expertConsumer GPU (12GB)78.3% real-world successCommunity LeRobot pretrainingAsync inference: 2x throughputOpen weights: lerobot/smolvla_baseJHU capstone starting pointAsync inference: VLM thread || action expert thread → robot executes current chunk while VLM reads next obs
Figure 32.1π0 (Feb 2025) → π0.5 (Apr 2025) → SmolVLA (Jun 2025): the 2025 open-source VLA family. π0 introduced flow-matching action experts; π0.5 achieved open-world generalization via heterogeneous data; SmolVLA-450M delivers 78.3% real-world success on a consumer GPU with async inference. SmolVLA is the JHU capstone starting point.
Retrieve before you continue

Three questions on what you just read

Q1 Factual What is SmolVLA's parameter count, pretraining data, and reported real-world task success rate?
Q2 Conceptual Why does π0.5 co-train on data where 97.6% is not robot manipulation, and how does this enable open-world generalization?
Q3 Synthetic Describe the SmolVLA async inference architecture and calculate the throughput benefit for a task requiring 100 action steps.