GR00T N1 (NVIDIA GTC, March 18, 2025) introduced the first open foundation model for generalist humanoid robots. The central architectural insight: separate slow deliberative planning from fast reflexive execution. System 2 — a VLM — processes language instructions and generates task context at a slow deliberative rate. System 1 — a Diffusion Transformer (DiT) — executes motor commands at 50Hz using that context. The VLM thinks; the DiT acts. GR00T N1.5 (NVIDIA Research, June 11, 2025) improved all three components that connect System 1 and System 2: frozen Eagle 2.5 as the VLM (better vision-language grounding), a simplified MLP adapter with layer normalization (replacing the original cross-attention adapter), and FLARE loss (a trajectory-aware training objective). RoboCasa 30-demo success jumped from 17.4% to 47.5%. RefCOCOg IoU improved from below Qwen2.5-VL-3B to 89.6 (above Qwen's 85.2). The full stack — Isaac Sim → Isaac Lab → GR00T-Dreams synthetic data → on-robot deployment — is now open and documented.
System 1 / System 2 — the dual-system architecture
GR00T's System 2 is a Vision-Language Model (Eagle 2.5 in N1.5, frozen) that processes multi-view camera observations and natural language instructions. At each planning timestep, System 2 generates a context embedding that conditions System 1's action generation. The VLM runs at a low frequency (1-5Hz) — it is the deliberative, computationally expensive component that interprets the task. System 1 is a Diffusion Transformer (DiT) that takes System 2's context embedding and the current proprioceptive state and denoises an action chunk at 50Hz. The DiT is the reflexive, real-time component.
The architectural separation is not cosmetic: System 1 must run at 50Hz because humanoid control requires fast feedback. VLM inference at 50Hz is not feasible with current hardware (even small 3B VLMs take 50-200ms per forward pass). By separating the components, GR00T allows the VLM to run asynchronously at its own cadence while the DiT executes continuously at the required control frequency. This is architecturally identical to the human motor system: slow cortical planning + fast spinal cord execution.
N1 to N1.5 — three improvements
GR00T N1.5 (NVIDIA Research, June 11, 2025) made three specific improvements over N1. First, frozen Eagle 2.5 as the System 2 VLM: Eagle 2.5 provides better vision-language grounding than the N1 VLM, improving spatial understanding and instruction following. Freezing the VLM during post-training prevents catastrophic forgetting of language capabilities while allowing the DiT and adapter to specialize. Second, simplified MLP adapter with layer normalization: replaces N1's cross-attention adapter between System 2 and System 1 with a simpler MLP + LayerNorm, reducing the parameter count and stabilizing training. Third, FLARE loss: a trajectory-aware training objective that penalizes temporal inconsistency in the action sequence, encouraging smoother and more executable trajectories.
The result: RoboCasa 30-demo task success from 17.4% (N1) to 47.5% (N1.5) — a 2.7x improvement from three architectural changes, not more data. RefCOCOg grounding IoU improved from below Qwen2.5-VL-3B (85.2) to 89.6, meaning N1.5 localizes referred objects more accurately than competitive VLMs. For the JHU humanoid capstone, this grounding improvement is directly useful: 'pick up the red mug on the left' requires precise spatial referencing, not just semantic object recognition.
The full GR00T stack
The NVIDIA open stack for GR00T deployment: Isaac Sim (simulation environment, photorealistic) → Isaac Lab (robot learning framework, domain randomization, PPO training) → GR00T-Dreams (synthetic data generation from real seeds, Newton physics engine) → GR00T N1.5 post-training (LoRA fine-tuning of the DiT and adapter on task-specific data) → on-robot deployment. The GitHub repository (NVIDIA/Isaac-GR00T) provides all components. The 1X NEO Gamma deployment by 1X Technologies (NVIDIA-1X, March 18, 2025) is the closest published reference to a production humanoid using this stack.
Post-training GR00T N1.5 follows the same recipe as SmolVLA fine-tuning but at higher capability and cost: 30 demonstrations, LoRA on the DiT and MLP adapter (VLM frozen), 5-10 epochs in Isaac Lab. The DreamGen pipeline can augment 30 real demos to hundreds of synthetic hours in the Newton physics engine, which is what drove the N1.5 benchmark numbers.
GR00T vs SmolVLA for the JHU capstone
GR00T N1.5 is the high-capability option for the JHU humanoid: better grounding (RefCOCOg IoU 89.6), higher benchmark performance (47.5% RoboCasa), and full integration with Isaac Lab for sim-to-real. SmolVLA-450M is the consumer-GPU option: smaller, faster inference, easier to fine-tune. The decision depends on hardware: GR00T N1.5 requires a 40GB+ GPU for inference (A100, H100), making it accessible on cloud but not consumer hardware. SmolVLA runs on a 12GB consumer GPU. For a capstone that must ship on a budget, start with SmolVLA; deploy GR00T N1.5 via cloud API for evaluation.
The 2.7x improvement from N1 to N1.5 came from three architectural changes, not more data or more compute. This is the kind of result that makes it worth reading architecture ablations carefully — the delta between 'good idea' and 'wrong connection point' is large in VLA systems.