From Tokens to Embodied Minds · Drill cards · Chapter 32
Drills
π0, π0.5, and SmolVLA
10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.
In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Tokens to Embodied Minds · Ch 32, note type = Basic.
| Front | Back |
|---|---|
| What is π0's action generation mechanism? | A flow-matching action expert (small DiT) conditioned on the VLM backbone's context embedding. Denoises Gaussian noise to an action chunk (K=16) in 10-25 ODE steps. |
| What VLM backbone does π0 use? | PaliGemma 3B (Physical Intelligence, February 4, 2025). |
| What percentage of π0.5's pretraining data is robot manipulation? | 2.4%. The remaining 97.6% is web data, verbal instructions, semantic scene prediction, and text-image pairs (π0.5, arXiv:2504.16054, April 22, 2025). |
| What is SmolVLA's parameter count? | 450M total: SmolVLM-256M vision-language backbone + flow-matching action expert. |
| What GPU is required to train SmolVLA? | A single 12GB consumer GPU (RTX 3080 / 4070 class). Deployable on a MacBook for inference. |
| What is SmolVLA's reported real-world task success rate after pretraining? | 78.3% on real-world tasks, pretrained on community LeRobot datasets (Hugging Face blog, June 3, 2025). |
| How does SmolVLA's async inference work? | VLM backbone runs in one thread producing context embeddings; flow-matching action expert runs in a parallel thread denoising action chunks. The robot executes the current chunk while the VLM processes the next observation — overlapping latencies. |
| What throughput improvement does SmolVLA async inference provide? | 30% faster response time and 2x task throughput vs synchronous inference (Hugging Face blog, June 3, 2025). |
| What is the primary source for SmolVLA? | SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data, Hugging Face blog, June 3, 2025. Model: lerobot/smolvla_base. |
| For the JHU capstone with a consumer GPU, which VLA is the recommended starting point? | SmolVLA-450M (lerobot/smolvla_base). Fine-tune on 50-200 LeRobot SO-101 episodes. Use async inference for real-time control. Reference GR00T N1.5 as the high-capability alternative if a 24GB+ GPU is available. |