From Tokens to Embodied Minds  ·  Drill cards · Chapter 32
Drills

π0, π0.5, and SmolVLA

10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.

10 cards due for review

In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Tokens to Embodied Minds · Ch 32, note type = Basic.

FrontBack
What is π0's action generation mechanism?A flow-matching action expert (small DiT) conditioned on the VLM backbone's context embedding. Denoises Gaussian noise to an action chunk (K=16) in 10-25 ODE steps.
What VLM backbone does π0 use?PaliGemma 3B (Physical Intelligence, February 4, 2025).
What percentage of π0.5's pretraining data is robot manipulation?2.4%. The remaining 97.6% is web data, verbal instructions, semantic scene prediction, and text-image pairs (π0.5, arXiv:2504.16054, April 22, 2025).
What is SmolVLA's parameter count?450M total: SmolVLM-256M vision-language backbone + flow-matching action expert.
What GPU is required to train SmolVLA?A single 12GB consumer GPU (RTX 3080 / 4070 class). Deployable on a MacBook for inference.
What is SmolVLA's reported real-world task success rate after pretraining?78.3% on real-world tasks, pretrained on community LeRobot datasets (Hugging Face blog, June 3, 2025).
How does SmolVLA's async inference work?VLM backbone runs in one thread producing context embeddings; flow-matching action expert runs in a parallel thread denoising action chunks. The robot executes the current chunk while the VLM processes the next observation — overlapping latencies.
What throughput improvement does SmolVLA async inference provide?30% faster response time and 2x task throughput vs synchronous inference (Hugging Face blog, June 3, 2025).
What is the primary source for SmolVLA?SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data, Hugging Face blog, June 3, 2025. Model: lerobot/smolvla_base.
For the JHU capstone with a consumer GPU, which VLA is the recommended starting point?SmolVLA-450M (lerobot/smolvla_base). Fine-tune on 50-200 LeRobot SO-101 episodes. Use async inference for real-time control. Reference GR00T N1.5 as the high-capability alternative if a 24GB+ GPU is available.