Predict before you read

Before you read — what is the core representational difference between NeRF and 3D Gaussian Splatting?

Think about implicit vs explicit representations and their consequences for rendering speed.

From Tokens to Embodied Minds  ·  Chapter 27 of 36
Chapter 27

3D perception — NeRF, Gaussian Splatting, SAM 2

The perception substrate for home-assistant robots

3DGS
Aug 2023 — replaced NeRF for most production uses: real-time rendering, explicit representation, editable
SAM 2
Aug 2024 — video segmentation with memory module, per-object grounding across frames
100
phone images is enough to train a 3DGS scene of a kitchen — the build entry cost
Maturity ladder

NeRF (Mildenhall et al., arXiv:2003.08934, March 19, 2020) made novel-view synthesis from images real. Three years later, 3D Gaussian Splatting (Kerbl et al., arXiv:2308.04079, Aug 8, 2023) replaced it for most production use cases: real-time rendering at 100+ fps, an explicit 3D representation you can inspect and edit, and training times measured in minutes rather than hours on a modern GPU. SAM 2 (Meta, arXiv:2408.00714, Aug 1, 2024) added the missing object layer: video-consistent segmentation with a memory module that tracks objects across frames. Together, 3DGS and SAM 2 are the perception substrate for any home-assistant robot that needs to understand and remember a 3D scene from RGB cameras. The capstone wiring is direct: the JHU humanoid builds a 3DGS map of the home environment from a wrist camera, SAM 2 provides per-object instance masks on novel-view renders, and the resulting object-grounded 3D map feeds the VLA planner. This chapter teaches you to build that map from scratch.

From NeRF to 3D Gaussian Splatting

NeRF represents a scene as an implicit neural function: given a 3D coordinate (x,y,z) and viewing direction, a small MLP outputs density sigma and color c. To render a pixel, you march a ray through the scene, query the MLP at hundreds of points along the ray, and integrate density-weighted colors via volume rendering. This is differentiable — you can optimize the MLP weights against multi-view image supervision using only RGB cameras. The weakness is speed: rendering a single frame at 800x600 requires tens of millions of MLP queries. NeRF is too slow for real-time robotics.

3D Gaussian Splatting (Kerbl et al., Aug 8, 2023) replaces the implicit MLP with millions of explicit 3D Gaussians, each defined by position, a 3x3 covariance matrix (controlling shape and orientation), opacity, and spherical harmonic coefficients for view-dependent color. Rendering projects ("splats") these Gaussians onto the image plane in depth order — a tile-based rasterization that runs entirely on GPU and achieves real-time frame rates. Critically, the representation is explicit: you can inspect individual Gaussians, remove them, add new ones from new camera frames (online map updates), and directly compute 3D bounding volumes. For the robotics use case, this is the decisive advantage.

SAM 2 — video segmentation with memory

SAM 2 (Meta, arXiv:2408.00714, Aug 1, 2024) extends the original Segment Anything Model to video. The core addition is a memory module: a cross-attention mechanism over a bank of past frame features allows the model to track object instances across frames without requiring re-prompting. A user (or automated system) clicks on an object in frame 1; SAM 2 propagates the mask forward through the video, maintaining identity even through occlusion and re-appearance. The model architecture combines a streaming encoder (for per-frame features), a memory encoder (for past frames), and a mask decoder.

For a home-assistant robot, SAM 2 provides the object-level grounding that 3DGS lacks: 3DGS gives you the 3D scene geometry, SAM 2 tells you which Gaussians belong to which object. Run SAM 2 on novel-view renders from a 3DGS scene, then project the per-pixel segment IDs back onto the Gaussian primitives (by rendering per-Gaussian IDs) and you have a semantically labeled 3D scene. The VLA can then be prompted with "grasp the mug" and retrieve the 3D bounding volume of the mug from the map.

Building a scene memory for the humanoid

The build for this chapter is the most direct precursor to the capstone scene memory: capture 100 phone images of a kitchen scene from varied viewpoints, train a 3DGS scene with gsplat (nerfstudio-project/gsplat, the fastest open-source 3DGS library), render novel views, and run SAM 2 segmentation on the renders. The resulting per-object 3D map is a concrete artifact — a scene representation your VLA can query.

For the JHU humanoid capstone, this map is updated online as the robot moves: new wrist-camera frames are used to add or refine Gaussian primitives (gsplat supports online densification), and SAM 2 re-segments any new objects that enter the field of view. The full perception pipeline is: 3DGS map (spatial structure) + DINOv2 features per object (visual embedding) + SAM 2 masks (instance identity) + SmolVLA (action policy). Each component from this chapter feeds a downstream component in the capstone.

NeRF vs 3DGS in 2026

NeRF is not dead — it remains superior for unbounded outdoor scenes and for cases where the implicit representation's compactness matters. But for indoor manipulation scenes with a known workspace boundary, 3DGS is faster to train, faster to render, and easier to edit. Use 3DGS.

3D Perception Pipeline for Home-Assistant RobotsRGB Images100 viewswrist camera3D Gaussian SplattingKerbl et al. Aug 2023Explicit Gaussians, real-timeNovel-ViewRender100+ fps, editableSAM 2Meta Aug 2024Video seg + memoryLabeled 3DScene MapObject-groundedNeRF (comparison)Implicit MLPSlow, not real-timevsCapstone scene memory3DGS map + DINOv2 embeddings + SAM 2 masks→ VLA queries by object name
Figure 27.13D perception pipeline. RGB images train a 3DGS scene (Kerbl et al., Aug 2023) that renders novel views in real time. SAM 2 (Meta, Aug 2024) segments objects in those renders using a memory module for video consistency. Per-pixel segment IDs project back onto Gaussian primitives to produce a labeled 3D scene map — the scene memory substrate for the JHU humanoid capstone.
Retrieve before you continue

Three questions on what you just read

Q1 Factual What are the five parameters that define a single 3D Gaussian primitive in 3DGS?
Q2 Conceptual How does SAM 2's memory module enable video object tracking?
Q3 Synthetic How would you build a semantically labeled 3D scene map for the JHU humanoid using 3DGS and SAM 2?