From Tokens to Embodied Minds · Drill cards · Chapter 31
Drills
Vision-Language-Action models — RT-2 and OpenVLA
10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.
In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Tokens to Embodied Minds · Ch 31, note type = Basic.
| Front | Back |
|---|---|
| How does RT-2 tokenize robot actions? | Each of 7 action dimensions is discretized into 256 bins and encoded as one text token. The 7-token action sequence is generated by the same next-token prediction machinery as language. |
| What is the emergent generalization finding from RT-2? | Approximately 2x improvement on semantic reasoning tasks (new objects, novel instructions) vs RT-1 — emergent from co-finetuning the VLM on both robot trajectories and web VQA data. |
| What three components does OpenVLA improve over RT-2? | 1. Dual vision encoder: DINOv2 ViT-B/14 + SigLIP So400M fused. 2. Open LLM backbone: Llama 2 7B. 3. Larger training dataset: 970K demos from Open X-Embodiment (vs RT-1's ~130K). |
| What is Open X-Embodiment? | A dataset of 970K robot demonstrations from 22 robot types across 21 research institutions, aggregated by the RT-X team (2023). Training on this diverse dataset gives OpenVLA cross-embodiment generalization. |
| What is OpenVLA-OFT? | Optimized Fine-Tuning variant of OpenVLA: parallel decoding (all 7 action tokens simultaneously), action chunking, and continuous action representation. Reduces inference latency from ~2 seconds to under 200ms. |
| What LoRA rank is recommended for OpenVLA fine-tuning on a 30-demo dataset? | Rank 16-32 applied to the Llama 2 backbone, with DINOv2 and SigLIP frozen. This fits in 24GB VRAM and converges in 5-10 epochs on small datasets. |
| What is the inference latency of base OpenVLA vs OpenVLA-OFT? | Base OpenVLA: ~2 seconds per action (7 sequential tokens). OpenVLA-OFT: under 200ms per action (parallel decoding + action chunking). |
| What is the primary source for OpenVLA? | OpenVLA: An Open-Source Vision-Language-Action Model, Kim et al., arXiv:2406.09246, June 13, 2024. Project: openvla.github.io. |
| On the 29-task benchmark, how does OpenVLA compare to RT-2-X? | OpenVLA (7B) beats RT-2-X (55B) by 16.5 percentage points absolute task success across 29 tasks on 7 robot platforms. |
| For the JHU humanoid capstone, when should you use OpenVLA vs SmolVLA? | OpenVLA (7B) if you have a 24GB+ GPU and want the strongest open VLA baseline. SmolVLA (450M) if you have a 12GB consumer GPU or need faster inference. Start with OpenVLA zero-shot, use the gap to decide whether to fine-tune OpenVLA or switch to SmolVLA for the hardware constraint. |