Chapter 31 · Vision-Language-Action models — RT-2 and OpenVLA

RT-2 (Google DeepMind, July 28, 2023) answered a question that seemed obvious only in retrospect: if VLMs already know that 'coffee cup' is graspable, 'fragile' means handle gently, and 'leftmost' is a spatial instruction, why not just teach them robot actions in the same token stream? RT-2 co-finetunes PaLI-X (55B) or PaLM-E (12B) on robot trajectories from the RT-1 dataset alongside web VQA data, representing each action dimension as a text token. The result: approximately 2x generalization improvement on semantic reasoning tasks — emergent capabilities like 'pick up the object that could help with headache' that no robot-only model had shown. OpenVLA (Kim et al., arXiv:2406.09246, June 13, 2024) reproduced and extended RT-2 in open source: 7B parameters (Llama 2 backbone + DINOv2 ViT-B/14 + SigLIP ViT-So400M), trained on 970K demonstrations from the Open X-Embodiment dataset, beating closed RT-2-X (55B) by 16.5 percentage points absolute task success across 29 tasks. It is the first reproducible, open-weights VLA and the baseline everything since has measured against.

RT-2 — the VLM-as-policy architecture

RT-2's core architectural decision is action tokenization: each of the 7 action dimensions (end-effector x, y, z, roll, pitch, yaw, gripper) is discretized into 256 bins and encoded as a single text token. A 7-DoF action becomes a sequence of 7 tokens appended to the VLM's output. The model is then co-finetuned on robot trajectory data (input: image + text instruction, output: action token sequence) alongside the original VLM pre-training data (web image-text pairs). This multi-task training preserves the VLM's semantic knowledge while teaching it robot actions — the same weights handle 'describe this image' and 'grasp this object,' which is what enables the emergent generalization.

The price is discretization error: 256 bins for end-effector position in a workspace of ~50cm gives ~2mm resolution. For coarse manipulation (pick-and-place) this is acceptable; for precision assembly (inserting a USB connector) it is not. RT-2-X (the scaled version, 55B parameters, trained on the Open X-Embodiment dataset) achieves broad generalization but at inference latency measured in seconds — too slow for reactive control.

OpenVLA — open-source VLA at 7B

OpenVLA (Kim et al., June 13, 2024) retains RT-2's action tokenization scheme but improves three components. First, the vision encoder: instead of PaLI-X's single encoder, OpenVLA fuses DINOv2 ViT-B/14 (dense spatial features) and SigLIP ViT-So400M (language-aligned features), concatenating their outputs before the LLM backbone. Second, the LLM: Llama 2 7B, fully open weights. Third, the training data: 970K demonstrations from the Open X-Embodiment dataset — more diverse and larger than RT-1. The result is a model that beats RT-2-X 55B by 16.5 percentage points on 29 tasks across 7 robot platforms, despite being 7x smaller and fully reproducible.

OpenVLA-OFT (Optimized Fine-Tuning, Kim et al., 2024 follow-up) addresses the original model's inference latency with parallel decoding (all 7 action tokens decoded simultaneously rather than sequentially), action chunking, and continuous action representation (replacing discrete bins with a Gaussian action head). These changes reduce inference latency from ~2 seconds to under 200ms, making OpenVLA-OFT deployable on real robots at 5Hz.

Open X-Embodiment and why training data matters

The Open X-Embodiment dataset (RT-X team, 2023) aggregated demonstrations from 22 robot types across 21 research institutions — 970K episodes, diverse objects, tasks, and embodiments. OpenVLA's training on this dataset is what gives it cross-embodiment generalization: a policy that has seen grasping on a WidowX, a Franka, and a UR5 learns that 'grasp' is a semantic primitive independent of the specific arm. This is the same insight that scaled language model pretraining: generalization comes from data diversity, not data volume alone.

For fine-tuning OpenVLA on your own tasks, the standard recipe is LoRA on the LLM backbone (typically rank 16-32) while keeping the vision encoders frozen. Berkeley CS 294-277 Robots That Learn (Jitendra Malik, Spring 2026) uses OpenVLA as a central case study. The openvla.github.io project page has fine-tuning scripts targeting the SO-100/SO-101 hardware.

Capstone position

OpenVLA is the strongest open baseline for the JHU humanoid before choosing between it, GR00T N1.5, and SmolVLA. The decision matrix: OpenVLA at 7B requires a 24GB+ GPU for fine-tuning; SmolVLA at 450M runs on a consumer 12GB GPU; GR00T N1.5 is the highest capability but the heaviest. Run OpenVLA fine-tuning first on 30 demonstrations — the zero-shot vs fine-tuned gap will tell you whether the task is within the pretrained distribution or requires more data.

On emergent VLA capabilities

RT-2's '2x generalization' is real but narrow: it generalizes to new objects within the semantic space of its VLM pretraining. It does not generalize to new physical environments, new task structures, or contact-rich manipulation that was not in the demonstration data. π0.5 is the current frontier on that harder generalization.

Figure 31.1RT-2 (Google DeepMind, July 2023) vs OpenVLA (Kim et al., June 2024). Both use action tokenization (7 dims × 256 bins → text tokens). OpenVLA's dual vision encoder (DINOv2 + SigLIP) and open Llama 2 7B backbone beat RT-2-X 55B by 16.5 points on 29 tasks at 7x lower parameter count.

Primary source · Build · Capstone ladder

Primary source. OpenVLA: An Open-Source Vision-Language-Action Model, Kim et al., arXiv:2406.09246, June 13, 2024

Build. Run OpenVLA inference (lerobot/openvla or openvla.github.io) on a MuJoCo Franka arm or LeRobot SO-100. Report zero-shot success rate on 5 tasks and per-step inference latency. Then fine-tune on 30 demonstrations using LoRA (rank 16), train 5 epochs, report fine-tuned vs zero-shot success rate delta.

Capstone ladder. OpenVLA is the strongest open baseline for the JHU humanoid before deciding between it, GR00T N1.5, and SmolVLA. The fine-tuning experiment on 30 demos gives you empirical evidence for whether the task is within the pretrained distribution — the most important signal for choosing between the three VLA options.

Retrieve before you continue

Three questions on what you just read

Q1 Factual How does RT-2 represent a 7-DoF robot action as text tokens?

Q2 Conceptual Why does co-finetuning on both robot data and web VQA data enable emergent generalization in RT-2?

Q3 Synthetic You want to fine-tune OpenVLA on 30 demonstrations of a household task on a 24GB GPU. What is the recommended procedure and expected outcome?