Chapter 36 · Capstone — a humanoid home assistant, end to end

This is not a demo. The JHU humanoid home-assistant capstone is six months of work and a body of evidence you ship publicly. The perception layer, the planner, the policy, the simulation tier, the safety stack, and the observability infrastructure are not six separate projects that happen to run on the same robot — they are six subsystems with explicit interfaces that must be designed, integrated, tested, and documented as a system. Pulling every chapter of Part III together: DINOv2 ViT-B/14 + SAM 2 + 3DGS for scene understanding (Ch 26, 27), LangGraph state machine with MCP home-control tools for planning (Ch 22, 21), SmolVLA-450M or GR00T N1.5 for manipulation policy (Ch 32, 33), Isaac Lab with Newton physics for simulation (Ch 29), a five-layer safety filter stack (Ch 35), and rerun.io for trajectory observability. The closest published reference architecture is 1X Technologies' NEO Gamma deployment of GR00T N1 (NVIDIA-1X partnership, March 18, 2025).

Perception: DINOv2 + SAM 2 + 3DGS

The perception layer has three components that operate in sequence. First: the 3D Gaussian Splatting scene map (Ch 27), built from a wrist camera and updated online as the robot moves. This gives the robot a 3D geometric understanding of the environment — where objects are in space. Second: DINOv2 ViT-B/14 (Ch 26), running on the live camera stream to compute visual embeddings for detected objects. These embeddings provide object identity (is this the mug I saw yesterday?) and attribute grounding (is this the red mug or the blue one?). Third: SAM 2 (Ch 27) for instance segmentation on camera frames — producing per-object masks that are projected back onto the 3DGS map to maintain labeled 3D object representations.

The interface between perception and planning: a JSON object map, updated at 2-5Hz, containing for each detected object: 3D bounding box (in the robot's world frame), DINOv2 embedding (for identity matching), semantic label (mug, plate, fork, etc.), and current SAM 2 instance mask. The planner queries this map via MCP tool calls (get_object_position, find_object_by_type). The coordinate system is the robot's world frame — the perception system must handle the transform from camera frame to world frame given the robot's current pose.

Planning: LangGraph + MCP home-control tools

The planner is a LangGraph state machine (Ch 22) with an LLM (GPT-4o or Claude 3.5 Sonnet) at its core. The LLM receives the user's task instruction and the current object map, and generates a sequence of high-level action primitives: pick_up(mug), place_on(counter), open_drawer(). Each primitive is an MCP tool call (Ch 21) that the planner resolves to a specific policy invocation with concrete object coordinates from the perception map.

The critical design decision: every irreversible physical action requires an explicit human-confirmation interrupt before execution (Ch 35). The LangGraph state machine has a confirmation state that blocks execution until the user confirms (voice, button, or app). The planner also has a 'safe abort' state that parks the robot arm and drops all pending actions — reachable from any state via the physical interrupt button. Langfuse traces (Ch 23) record all LLM calls with inputs, outputs, and latencies for offline analysis. For the capstone, this is the observability layer above the robot controller.

Policy: SmolVLA or GR00T N1.5 → PD controller

The policy receives from the planner: (1) the task primitive (e.g., 'pick_up'), (2) the target object's 3D position and DINOv2 embedding, and (3) the current camera observation. SmolVLA (Ch 32) or GR00T N1.5 (Ch 33) outputs an action chunk (K=8-16 future joint-position targets) at 5-30Hz. The trajectory interpolator converts this chunk to a per-joint target sequence. The IK solver (Ch 25) converts end-effector targets to joint angles. The PD/impedance controller (Ch 25) closes the torque loop at 1kHz. The five-layer safety filter (Ch 35) intercepts the controller output before it reaches the motors.

The interface contract at each boundary: policy → controller: joint-position targets in radians, timestamped at 30Hz; controller → motor: torque commands in Nm, timestamped at 1kHz; safety filter → motor: same format as controller, but with commands zeroed if any limit is triggered. The data format mismatch at these interfaces is the most common integration failure — document and validate each interface explicitly before integration testing.

Simulation and observability

The simulation tier (Ch 29) supports the capstone in two ways. During development: Isaac Lab with Newton physics is the environment for policy fine-tuning and regression testing — 47 of the 50 canonical test tasks can be evaluated in simulation before touching real hardware. During data augmentation: GR00T-Dreams generates synthetic training data from the 200-episode LeRobot dataset, extending the policy's coverage of household task variants.

Observability: rerun.io for real-time trajectory replay and debugging (every joint position, end-effector pose, camera frame, and policy action is logged at the timestep level), Langfuse for planner LLM traces (every LLM call with prompt, response, latency, and token cost), and an offline regression suite of 50 canonical tasks with a defined pass/fail criterion for each. The 50-task regression suite — not a cherry-picked demo — is the honest measure of capstone completion. Tasks should span: tabletop manipulation (20 tasks), whole-body navigation + manipulation (15 tasks), multi-step sequences (10 tasks), and safety-filter validation (5 tasks).

Public output artifacts

The capstone is not complete until these five artifacts exist publicly: (1) GitHub mono-repo (tokens-to-embodied-minds) with one runnable notebook per chapter and the full capstone code; (2) Capstone repo (jhu-humanoid-capstone) with architecture diagram, module-level README, and the 50-task regression suite results; (3) Demo video — 3-5 minutes showing the full stack on at least 5 diverse canonical tasks; (4) LeRobot dataset — 200 episodes uploaded to HF Hub under your username with a complete dataset card; (5) Technical write-up — 5,000 words covering architecture decisions, failure modes, and lessons learned, targeted at a GP audience reading about humanoid robotics diligence. The closest published reference for the architecture: 1X NEO Gamma deployment (NVIDIA-1X, March 18, 2025). The closest published reference for the write-up format: the GR00T N1.5 tech report (NVIDIA Research, June 11, 2025).

On timeline honesty

Six months at 12 hours/week is 288 hours. That is a realistic minimum for integration, iteration, and the 50-task regression suite. If you are starting from zero hardware: add 4-6 weeks for SO-101 build, calibration, and the first 50 recorded episodes. Budget accordingly.

Figure 36.1Full JHU humanoid capstone stack. Six subsystems with explicit interfaces: Perception (DINOv2 + SAM 2 + 3DGS) → LangGraph Planner (MCP tools, LLM, confirmation gate) → Policy (SmolVLA or GR00T N1.5, 5-30Hz) → Controller (IK + PD/impedance, 1kHz) → Safety Filter (5 layers) → Motors. Simulation (Isaac Lab) and observability (rerun.io, Langfuse) support all layers. Five public artifacts constitute completion.

Primary source · Build · Capstone ladder

Primary source. All of the above, integrated. Closest published reference: 1X NEO Gamma deployment of GR00T N1, NVIDIA-1X, March 18, 2025

Build. Ship the capstone. Public GitHub repo with architecture diagram and chapter notebooks. Public capstone repo (jhu-humanoid-capstone) with 50-task regression suite results. Demo video (3-5 minutes, 5+ canonical tasks). 200-episode LeRobot dataset on HF Hub with dataset card. 5,000-word technical write-up. Timeline: 6 months at 12 hours/week.

Capstone ladder. This is the JHU capstone. The demo video is the last artifact you build, not the first. The 50-task regression suite is the completion criterion. The five public artifacts are the body of evidence that makes this a portfolio piece, not just a project.

Retrieve before you continue

Three questions on what you just read

Q1 Factual What are the five public output artifacts of the JHU humanoid capstone?

Q2 Conceptual What are the interface contracts at the policy → controller and controller → motor boundaries?

Q3 Synthetic The capstone robot picks up the correct object 90% of the time but drops it 40% of the time during transport. Which subsystem do you investigate first?