NeRF (Mildenhall et al., arXiv:2003.08934, March 19, 2020) made novel-view synthesis from images real. Three years later, 3D Gaussian Splatting (Kerbl et al., arXiv:2308.04079, Aug 8, 2023) replaced it for most production use cases: real-time rendering at 100+ fps, an explicit 3D representation you can inspect and edit, and training times measured in minutes rather than hours on a modern GPU. SAM 2 (Meta, arXiv:2408.00714, Aug 1, 2024) added the missing object layer: video-consistent segmentation with a memory module that tracks objects across frames. Together, 3DGS and SAM 2 are the perception substrate for any home-assistant robot that needs to understand and remember a 3D scene from RGB cameras. The capstone wiring is direct: the JHU humanoid builds a 3DGS map of the home environment from a wrist camera, SAM 2 provides per-object instance masks on novel-view renders, and the resulting object-grounded 3D map feeds the VLA planner. This chapter teaches you to build that map from scratch.
From NeRF to 3D Gaussian Splatting
NeRF represents a scene as an implicit neural function: given a 3D coordinate (x,y,z) and viewing direction, a small MLP outputs density sigma and color c. To render a pixel, you march a ray through the scene, query the MLP at hundreds of points along the ray, and integrate density-weighted colors via volume rendering. This is differentiable — you can optimize the MLP weights against multi-view image supervision using only RGB cameras. The weakness is speed: rendering a single frame at 800x600 requires tens of millions of MLP queries. NeRF is too slow for real-time robotics.
3D Gaussian Splatting (Kerbl et al., Aug 8, 2023) replaces the implicit MLP with millions of explicit 3D Gaussians, each defined by position, a 3x3 covariance matrix (controlling shape and orientation), opacity, and spherical harmonic coefficients for view-dependent color. Rendering projects ("splats") these Gaussians onto the image plane in depth order — a tile-based rasterization that runs entirely on GPU and achieves real-time frame rates. Critically, the representation is explicit: you can inspect individual Gaussians, remove them, add new ones from new camera frames (online map updates), and directly compute 3D bounding volumes. For the robotics use case, this is the decisive advantage.
SAM 2 — video segmentation with memory
SAM 2 (Meta, arXiv:2408.00714, Aug 1, 2024) extends the original Segment Anything Model to video. The core addition is a memory module: a cross-attention mechanism over a bank of past frame features allows the model to track object instances across frames without requiring re-prompting. A user (or automated system) clicks on an object in frame 1; SAM 2 propagates the mask forward through the video, maintaining identity even through occlusion and re-appearance. The model architecture combines a streaming encoder (for per-frame features), a memory encoder (for past frames), and a mask decoder.
For a home-assistant robot, SAM 2 provides the object-level grounding that 3DGS lacks: 3DGS gives you the 3D scene geometry, SAM 2 tells you which Gaussians belong to which object. Run SAM 2 on novel-view renders from a 3DGS scene, then project the per-pixel segment IDs back onto the Gaussian primitives (by rendering per-Gaussian IDs) and you have a semantically labeled 3D scene. The VLA can then be prompted with "grasp the mug" and retrieve the 3D bounding volume of the mug from the map.
Building a scene memory for the humanoid
The build for this chapter is the most direct precursor to the capstone scene memory: capture 100 phone images of a kitchen scene from varied viewpoints, train a 3DGS scene with gsplat (nerfstudio-project/gsplat, the fastest open-source 3DGS library), render novel views, and run SAM 2 segmentation on the renders. The resulting per-object 3D map is a concrete artifact — a scene representation your VLA can query.
For the JHU humanoid capstone, this map is updated online as the robot moves: new wrist-camera frames are used to add or refine Gaussian primitives (gsplat supports online densification), and SAM 2 re-segments any new objects that enter the field of view. The full perception pipeline is: 3DGS map (spatial structure) + DINOv2 features per object (visual embedding) + SAM 2 masks (instance identity) + SmolVLA (action policy). Each component from this chapter feeds a downstream component in the capstone.
NeRF is not dead — it remains superior for unbounded outdoor scenes and for cases where the implicit representation's compactness matters. But for indoor manipulation scenes with a known workspace boundary, 3DGS is faster to train, faster to render, and easier to edit. Use 3DGS.