Predict before you read

Before you read — GR00T uses a 'System 1 / System 2' architecture. Which component runs at 50Hz and which runs at a slower deliberative rate?

Think about the dual-process theory of cognition: fast/reflexive vs slow/deliberative.

From Tokens to Embodied Minds  ·  Chapter 33 of 36
Chapter 33

NVIDIA GR00T N1 and N1.5 with Isaac Lab

The dual-system humanoid foundation model

47.5%
GR00T N1.5 RoboCasa 30-demo success — up from 17.4% for N1 (NVIDIA Research, Jun 11, 2025)
89.6
RefCOCOg IoU for N1.5 — surpassing Qwen2.5-VL-3B (85.2) with better grounding from Eagle 2.5
50Hz
System 1 (DiT action model) execution rate — System 2 VLM plans; System 1 executes at reflex speed
Maturity ladder

GR00T N1 (NVIDIA GTC, March 18, 2025) introduced the first open foundation model for generalist humanoid robots. The central architectural insight: separate slow deliberative planning from fast reflexive execution. System 2 — a VLM — processes language instructions and generates task context at a slow deliberative rate. System 1 — a Diffusion Transformer (DiT) — executes motor commands at 50Hz using that context. The VLM thinks; the DiT acts. GR00T N1.5 (NVIDIA Research, June 11, 2025) improved all three components that connect System 1 and System 2: frozen Eagle 2.5 as the VLM (better vision-language grounding), a simplified MLP adapter with layer normalization (replacing the original cross-attention adapter), and FLARE loss (a trajectory-aware training objective). RoboCasa 30-demo success jumped from 17.4% to 47.5%. RefCOCOg IoU improved from below Qwen2.5-VL-3B to 89.6 (above Qwen's 85.2). The full stack — Isaac Sim → Isaac Lab → GR00T-Dreams synthetic data → on-robot deployment — is now open and documented.

System 1 / System 2 — the dual-system architecture

GR00T's System 2 is a Vision-Language Model (Eagle 2.5 in N1.5, frozen) that processes multi-view camera observations and natural language instructions. At each planning timestep, System 2 generates a context embedding that conditions System 1's action generation. The VLM runs at a low frequency (1-5Hz) — it is the deliberative, computationally expensive component that interprets the task. System 1 is a Diffusion Transformer (DiT) that takes System 2's context embedding and the current proprioceptive state and denoises an action chunk at 50Hz. The DiT is the reflexive, real-time component.

The architectural separation is not cosmetic: System 1 must run at 50Hz because humanoid control requires fast feedback. VLM inference at 50Hz is not feasible with current hardware (even small 3B VLMs take 50-200ms per forward pass). By separating the components, GR00T allows the VLM to run asynchronously at its own cadence while the DiT executes continuously at the required control frequency. This is architecturally identical to the human motor system: slow cortical planning + fast spinal cord execution.

N1 to N1.5 — three improvements

GR00T N1.5 (NVIDIA Research, June 11, 2025) made three specific improvements over N1. First, frozen Eagle 2.5 as the System 2 VLM: Eagle 2.5 provides better vision-language grounding than the N1 VLM, improving spatial understanding and instruction following. Freezing the VLM during post-training prevents catastrophic forgetting of language capabilities while allowing the DiT and adapter to specialize. Second, simplified MLP adapter with layer normalization: replaces N1's cross-attention adapter between System 2 and System 1 with a simpler MLP + LayerNorm, reducing the parameter count and stabilizing training. Third, FLARE loss: a trajectory-aware training objective that penalizes temporal inconsistency in the action sequence, encouraging smoother and more executable trajectories.

The result: RoboCasa 30-demo task success from 17.4% (N1) to 47.5% (N1.5) — a 2.7x improvement from three architectural changes, not more data. RefCOCOg grounding IoU improved from below Qwen2.5-VL-3B (85.2) to 89.6, meaning N1.5 localizes referred objects more accurately than competitive VLMs. For the JHU humanoid capstone, this grounding improvement is directly useful: 'pick up the red mug on the left' requires precise spatial referencing, not just semantic object recognition.

The full GR00T stack

The NVIDIA open stack for GR00T deployment: Isaac Sim (simulation environment, photorealistic) → Isaac Lab (robot learning framework, domain randomization, PPO training) → GR00T-Dreams (synthetic data generation from real seeds, Newton physics engine) → GR00T N1.5 post-training (LoRA fine-tuning of the DiT and adapter on task-specific data) → on-robot deployment. The GitHub repository (NVIDIA/Isaac-GR00T) provides all components. The 1X NEO Gamma deployment by 1X Technologies (NVIDIA-1X, March 18, 2025) is the closest published reference to a production humanoid using this stack.

Post-training GR00T N1.5 follows the same recipe as SmolVLA fine-tuning but at higher capability and cost: 30 demonstrations, LoRA on the DiT and MLP adapter (VLM frozen), 5-10 epochs in Isaac Lab. The DreamGen pipeline can augment 30 real demos to hundreds of synthetic hours in the Newton physics engine, which is what drove the N1.5 benchmark numbers.

GR00T vs SmolVLA for the JHU capstone

GR00T N1.5 is the high-capability option for the JHU humanoid: better grounding (RefCOCOg IoU 89.6), higher benchmark performance (47.5% RoboCasa), and full integration with Isaac Lab for sim-to-real. SmolVLA-450M is the consumer-GPU option: smaller, faster inference, easier to fine-tune. The decision depends on hardware: GR00T N1.5 requires a 40GB+ GPU for inference (A100, H100), making it accessible on cloud but not consumer hardware. SmolVLA runs on a 12GB consumer GPU. For a capstone that must ship on a budget, start with SmolVLA; deploy GR00T N1.5 via cloud API for evaluation.

On the 17.4% → 47.5% jump

The 2.7x improvement from N1 to N1.5 came from three architectural changes, not more data or more compute. This is the kind of result that makes it worth reading architecture ablations carefully — the delta between 'good idea' and 'wrong connection point' is large in VLA systems.

GR00T N1.5 — Dual-System ArchitectureSystem 2 — VLM Planner (Slow)Eagle 2.5 VLMFrozen in N1.5Multi-view RGB+ Language instruction→ Context embedding (1–5 Hz)RefCOCOg IoU: 89.6 (vs Qwen 85.2)Deliberative: interprets task, localizes objectsMLP adapter + LayerNorm → System 1System 1 — DiT Executor (Fast)Diffusion Transformer(DiT) action modelProprioceptive state+ System 2 context→ Action chunk (50 Hz)FLARE loss: temporal consistencyReflexive: executes motor commandsLoRA fine-tunable on 30 demoscontextN1 → N1.5: Eagle 2.5 frozen + MLP adapter + FLARE loss = 17.4% → 47.5% (RoboCasa 30-demo)
Figure 33.1GR00T N1.5 dual-system architecture. System 2 (frozen Eagle 2.5 VLM) processes observations and instructions at 1-5Hz, producing a context embedding via MLP adapter. System 1 (DiT) executes action chunks at 50Hz using that context. Three N1.5 improvements (Eagle 2.5, simplified adapter, FLARE loss) drove RoboCasa success from 17.4% to 47.5%.
Retrieve before you continue

Three questions on what you just read

Q1 Factual What are the three architectural changes in GR00T N1.5 vs N1?
Q2 Conceptual Why does GR00T run System 1 (DiT) at 50Hz and System 2 (VLM) at a lower frequency?
Q3 Synthetic For the JHU humanoid capstone, when would you choose GR00T N1.5 over SmolVLA, and what is the minimum hardware requirement?