Predict before you read

Before you read — how does RT-2 represent robot actions so that a language model can learn them?

Think about what format language models already know how to generate.

From Tokens to Embodied Minds  ·  Chapter 31 of 36
Chapter 31

Vision-Language-Action models — RT-2 and OpenVLA

The architectures that turn VLMs into policies

55B→7B
RT-2-X to OpenVLA: 7x fewer parameters, +16.5% absolute task success across 29 tasks
970K
demonstrations from Open X-Embodiment dataset used to train OpenVLA — the largest open robot dataset
generalization improvement of RT-2 over RT-1 on semantic reasoning tasks — emergent from VLM co-finetuning
Maturity ladder

RT-2 (Google DeepMind, July 28, 2023) answered a question that seemed obvious only in retrospect: if VLMs already know that 'coffee cup' is graspable, 'fragile' means handle gently, and 'leftmost' is a spatial instruction, why not just teach them robot actions in the same token stream? RT-2 co-finetunes PaLI-X (55B) or PaLM-E (12B) on robot trajectories from the RT-1 dataset alongside web VQA data, representing each action dimension as a text token. The result: approximately 2x generalization improvement on semantic reasoning tasks — emergent capabilities like 'pick up the object that could help with headache' that no robot-only model had shown. OpenVLA (Kim et al., arXiv:2406.09246, June 13, 2024) reproduced and extended RT-2 in open source: 7B parameters (Llama 2 backbone + DINOv2 ViT-B/14 + SigLIP ViT-So400M), trained on 970K demonstrations from the Open X-Embodiment dataset, beating closed RT-2-X (55B) by 16.5 percentage points absolute task success across 29 tasks. It is the first reproducible, open-weights VLA and the baseline everything since has measured against.

RT-2 — the VLM-as-policy architecture

RT-2's core architectural decision is action tokenization: each of the 7 action dimensions (end-effector x, y, z, roll, pitch, yaw, gripper) is discretized into 256 bins and encoded as a single text token. A 7-DoF action becomes a sequence of 7 tokens appended to the VLM's output. The model is then co-finetuned on robot trajectory data (input: image + text instruction, output: action token sequence) alongside the original VLM pre-training data (web image-text pairs). This multi-task training preserves the VLM's semantic knowledge while teaching it robot actions — the same weights handle 'describe this image' and 'grasp this object,' which is what enables the emergent generalization.

The price is discretization error: 256 bins for end-effector position in a workspace of ~50cm gives ~2mm resolution. For coarse manipulation (pick-and-place) this is acceptable; for precision assembly (inserting a USB connector) it is not. RT-2-X (the scaled version, 55B parameters, trained on the Open X-Embodiment dataset) achieves broad generalization but at inference latency measured in seconds — too slow for reactive control.

OpenVLA — open-source VLA at 7B

OpenVLA (Kim et al., June 13, 2024) retains RT-2's action tokenization scheme but improves three components. First, the vision encoder: instead of PaLI-X's single encoder, OpenVLA fuses DINOv2 ViT-B/14 (dense spatial features) and SigLIP ViT-So400M (language-aligned features), concatenating their outputs before the LLM backbone. Second, the LLM: Llama 2 7B, fully open weights. Third, the training data: 970K demonstrations from the Open X-Embodiment dataset — more diverse and larger than RT-1. The result is a model that beats RT-2-X 55B by 16.5 percentage points on 29 tasks across 7 robot platforms, despite being 7x smaller and fully reproducible.

OpenVLA-OFT (Optimized Fine-Tuning, Kim et al., 2024 follow-up) addresses the original model's inference latency with parallel decoding (all 7 action tokens decoded simultaneously rather than sequentially), action chunking, and continuous action representation (replacing discrete bins with a Gaussian action head). These changes reduce inference latency from ~2 seconds to under 200ms, making OpenVLA-OFT deployable on real robots at 5Hz.

Open X-Embodiment and why training data matters

The Open X-Embodiment dataset (RT-X team, 2023) aggregated demonstrations from 22 robot types across 21 research institutions — 970K episodes, diverse objects, tasks, and embodiments. OpenVLA's training on this dataset is what gives it cross-embodiment generalization: a policy that has seen grasping on a WidowX, a Franka, and a UR5 learns that 'grasp' is a semantic primitive independent of the specific arm. This is the same insight that scaled language model pretraining: generalization comes from data diversity, not data volume alone.

For fine-tuning OpenVLA on your own tasks, the standard recipe is LoRA on the LLM backbone (typically rank 16-32) while keeping the vision encoders frozen. Berkeley CS 294-277 Robots That Learn (Jitendra Malik, Spring 2026) uses OpenVLA as a central case study. The openvla.github.io project page has fine-tuning scripts targeting the SO-100/SO-101 hardware.

Capstone position

OpenVLA is the strongest open baseline for the JHU humanoid before choosing between it, GR00T N1.5, and SmolVLA. The decision matrix: OpenVLA at 7B requires a 24GB+ GPU for fine-tuning; SmolVLA at 450M runs on a consumer 12GB GPU; GR00T N1.5 is the highest capability but the heaviest. Run OpenVLA fine-tuning first on 30 demonstrations — the zero-shot vs fine-tuned gap will tell you whether the task is within the pretrained distribution or requires more data.

On emergent VLA capabilities

RT-2's '2x generalization' is real but narrow: it generalizes to new objects within the semantic space of its VLM pretraining. It does not generalize to new physical environments, new task structures, or contact-rich manipulation that was not in the demonstration data. π0.5 is the current frontier on that harder generalization.

RT-2 vs OpenVLA ArchitectureRT-2 (Google DeepMind, Jul 2023)PaLI-X / PaLM-E VLMSingle vision encoder(ViT internal)Action tokenization: 7 dims × 256 bins → 7 text tokensCo-finetuned: robot demos + web VQA data55B parameters (RT-2-X)~2s inference latencyClosed weights+2x semantic generalization vs RT-1OpenVLA (Kim et al., Jun 2024)DINOv2 ViT-B/14(dense spatial)SigLIP So400M(language-aligned)Fused → Llama 2 7B backboneSame action tokenization as RT-2970K demos (Open X-Embodiment)7B parameters — open weightsLoRA fine-tuning on 30 demos+16.5% vs RT-2-X 55B (29 tasks)Winner: OpenVLA — 7x fewer parameters, +16.5 points, open weights, reproducible
Figure 31.1RT-2 (Google DeepMind, July 2023) vs OpenVLA (Kim et al., June 2024). Both use action tokenization (7 dims × 256 bins → text tokens). OpenVLA's dual vision encoder (DINOv2 + SigLIP) and open Llama 2 7B backbone beat RT-2-X 55B by 16.5 points on 29 tasks at 7x lower parameter count.
Retrieve before you continue

Three questions on what you just read

Q1 Factual How does RT-2 represent a 7-DoF robot action as text tokens?
Q2 Conceptual Why does co-finetuning on both robot data and web VQA data enable emergent generalization in RT-2?
Q3 Synthetic You want to fine-tune OpenVLA on 30 demonstrations of a household task on a 24GB GPU. What is the recommended procedure and expected outcome?