Vision-Language-Action models — RT-2 and OpenVLA

10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.

In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Tokens to Embodied Minds · Ch 31, note type = Basic.

Front	Back
How does RT-2 tokenize robot actions?	Each of 7 action dimensions is discretized into 256 bins and encoded as one text token. The 7-token action sequence is generated by the same next-token prediction machinery as language.
What is the emergent generalization finding from RT-2?	Approximately 2x improvement on semantic reasoning tasks (new objects, novel instructions) vs RT-1 — emergent from co-finetuning the VLM on both robot trajectories and web VQA data.
What three components does OpenVLA improve over RT-2?	1. Dual vision encoder: DINOv2 ViT-B/14 + SigLIP So400M fused. 2. Open LLM backbone: Llama 2 7B. 3. Larger training dataset: 970K demos from Open X-Embodiment (vs RT-1's ~130K).
What is Open X-Embodiment?	A dataset of 970K robot demonstrations from 22 robot types across 21 research institutions, aggregated by the RT-X team (2023). Training on this diverse dataset gives OpenVLA cross-embodiment generalization.
What is OpenVLA-OFT?	Optimized Fine-Tuning variant of OpenVLA: parallel decoding (all 7 action tokens simultaneously), action chunking, and continuous action representation. Reduces inference latency from ~2 seconds to under 200ms.
What LoRA rank is recommended for OpenVLA fine-tuning on a 30-demo dataset?	Rank 16-32 applied to the Llama 2 backbone, with DINOv2 and SigLIP frozen. This fits in 24GB VRAM and converges in 5-10 epochs on small datasets.
What is the inference latency of base OpenVLA vs OpenVLA-OFT?	Base OpenVLA: ~2 seconds per action (7 sequential tokens). OpenVLA-OFT: under 200ms per action (parallel decoding + action chunking).
What is the primary source for OpenVLA?	OpenVLA: An Open-Source Vision-Language-Action Model, Kim et al., arXiv:2406.09246, June 13, 2024. Project: openvla.github.io.
On the 29-task benchmark, how does OpenVLA compare to RT-2-X?	OpenVLA (7B) beats RT-2-X (55B) by 16.5 percentage points absolute task success across 29 tasks on 7 robot platforms.
For the JHU humanoid capstone, when should you use OpenVLA vs SmolVLA?	OpenVLA (7B) if you have a 24GB+ GPU and want the strongest open VLA baseline. SmolVLA (450M) if you have a 12GB consumer GPU or need faster inference. Start with OpenVLA zero-shot, use the gap to decide whether to fine-tune OpenVLA or switch to SmolVLA for the hardware constraint.