π0 (Physical Intelligence, February 4, 2025) changed the VLA architecture in two ways: it replaced action tokenization with a flow-matching action expert (a separate denoising network conditioned on the VLM's output), and it demonstrated dexterous manipulation across multiple embodiments — folding laundry, assembling boxes, bus-tray manipulation. The action expert captures multimodal action distributions that discrete tokenization cannot represent. π0.5 (Physical Intelligence, arXiv:2504.16054, April 22, 2025) pushed the frontier further: by co-training on heterogeneous data where 97.6% of pretraining is not mobile manipulation, the model generalizes to entirely new homes for 10-15 minute tasks like 'clean the bedroom.' SmolVLA-450M (Hugging Face, June 3, 2025) takes the opposite path: the smallest practical VLA, trainable on a single consumer GPU, achieving 78.3% real-world success on LeRobot community datasets, with asynchronous inference that gives 30% faster response and 2x task throughput.
π0 — VLM backbone + flow-matching action expert
π0's architecture separates reasoning from action generation. A VLM backbone (based on PaliGemma 3B) processes the image observation and language instruction, producing a context embedding. A separate flow-matching action expert takes this context embedding and denoises a Gaussian noise sample into an action chunk (K=16 future actions). The denoising model is a small diffusion transformer (DiT) conditioned on the VLM context. This separation allows the VLM to be updated with new language data without destabilizing the action expert, and allows the action expert to be fine-tuned on manipulation data without affecting the VLM's language capabilities.
The flow-matching objective (Lipman et al., 2022) trains the action expert to predict the velocity field that transports Gaussian noise toward demonstration actions. At inference, a 10-25 step ODE integration produces the action chunk. This is significantly faster than DDPM (50-100 steps) and produces smoother trajectories. π0's pretrained action expert demonstrates dexterous manipulation — folding clothes, multi-step assembly — that is qualitatively beyond what discrete action tokenization (RT-2 style) can represent.
π0.5 — open-world generalization
π0.5 (arXiv:2504.16054, April 22, 2025) co-trains on five data types: robot trajectory data (2.4% of pretraining), web video data, verbal instruction following, high-level semantic scene prediction, and internet text-image pairs. The 97.6% non-robot data is not incidental — it is the core of the architecture's generalization strategy. By training on diverse semantic tasks, the model develops representations that transfer to new homes without requiring any demonstrations from those environments.
The practical result: π0.5 can perform 10-15 minute multi-step household tasks (clean the bedroom, load the dishwasher) in entirely new environments. The model uses a high-level language planner (leveraging the VLM backbone) to decompose the task, and the action expert to execute each primitive step. The planner is what generalizes — it reasons about the new environment using the web-scale semantic knowledge. The action expert fine-tunes relatively quickly on a small number of demonstrations from the new environment. For the JHU capstone, π0.5 represents the capability ceiling for open-world household tasks — currently not open-source.
SmolVLA-450M — the consumer-GPU VLA
SmolVLA-450M (Hugging Face, June 3, 2025) is an efficiency-first VLA: 450M total parameters (SmolVLM-256M vision-language backbone + flow-matching action expert), trainable on a single 12GB consumer GPU (RTX 3080/4070), deployable on a MacBook for inference. Pretrained on community LeRobot datasets from HuggingFace Hub — a diverse set of publicly contributed manipulation demonstrations — SmolVLA achieves 78.3% real-world task success before any task-specific fine-tuning.
SmolVLA's async inference is the key latency innovation: the VLM backbone runs in one thread producing context embeddings, the flow-matching action expert runs in a parallel thread denoising action chunks. This eliminates the VLM's per-step latency from the action execution critical path — the robot executes the current action chunk while the VLM already processes the next observation. Result: 30% faster response time and 2x task throughput vs synchronous inference. For the JHU capstone, SmolVLA is the most likely starting point given the consumer-GPU constraint — the build for this chapter is the direct precursor to the capstone fine-tuning.
π0.5 is the capability ceiling — not open-source. SmolVLA is the practical floor — open-source, consumer GPU, community pretrained. Start with SmolVLA; reference π0.5 as the ablation target for what better data coverage achieves.