Chapter 14 · The NVL72 Rack — From Sand to Superintelligence

The unit of frontier AI is not a chip. It is not a server. It is a rack.

This is a strange thing to say in 2026. For thirty years, the unit of computing was the server: a single machine, perhaps with several CPUs and a handful of GPUs, with its own power supply, fans, network ports, and operating system. Racks were just convenient ways to stack many servers. The compute happened inside the box, and the network glued boxes together.

That model has broken. In the NVL72 design, the rack itself is the machine. There is no individual server inside it that is meaningful on its own. Every component is part of one cabinet-scale computer.

The rack as a unit

An NVL72 rack contains 18 compute trays, each holding 4 Rubin GPUs and 2 Vera CPUs (typically as two Superchips). It also contains 9 NVLink switch trays — separate units whose only job is to switch GPU-to-GPU traffic — plus power shelves, networking SuperNICs, and management hardware. Add it up and you get 72 Rubin GPUs and 36 Vera CPUs in a single 19-inch rack.

That rack delivers approximately 3.6 exaflops of FP4 inference compute, with about 20.7 TB of HBM4 memory in aggregate (72 GPUs × 288 GB). It draws roughly 190–230 kilowatts depending on whether the rack is tuned for energy efficiency (Max Q) or peak performance (Max P) — comparable to a few hundred US homes — and dissipates that power as heat that has to leave the cabinet without melting it.

NVLink 6 — every GPU to every other

The communication fabric inside the rack is NVLink 6, the latest generation of NVIDIA's GPU-to-GPU interconnect. Every Rubin GPU has many NVLink lanes; the lanes from all 72 GPUs are routed through the 9 NVLink switch trays, configured so that any GPU can talk to any other GPU at full bandwidth simultaneously.

The aggregate is staggering: 260 terabytes per second of all-to-all bandwidth across the 72 GPUs. To put that in context, the global internet's aggregate cross-sectional bandwidth is, by some estimates, in the same neighborhood. Inside one rack, NVIDIA delivers roughly the bandwidth of the entire commodity internet.

This bandwidth is what allows the rack to behave as a single GPU. A model whose parameters do not fit in any single GPU can be sharded across all 72; the latency of moving activations between shards is small enough that it does not dominate the training step. This is what permits trillion-parameter mixture-of-experts models to be trained at all, and to be served at usable throughput.

The midplane spine

To carry 260 TB/s of bandwidth between trays without melting, NVIDIA replaced the conventional cable harness with a copper midplane: a large printed circuit board running vertically through the center of the rack, into which compute trays plug from one side and switch trays plug from the other. Signals flow through the midplane's copper traces, not through cables.

The benefits of this are enormous. Assembly time per tray drops from ~2 hours to about 5 minutes. Reliability rises because there are no cables to be miswired or to come loose. Cost per tray-link drops dramatically. The thermal envelope shrinks because copper traces dissipate less than active cabling. And the rack becomes a serviceable unit: a tray that fails can be slid out and replaced, without disturbing anything else.

Liquid cooling, by necessity

Two hundred kilowatts is too much for air. There is no fan large enough, no airflow loud enough, no heat sink dense enough to evacuate that much power from a single rack with air alone. NVL72 is liquid-cooled from the start — and the next-generation Kyber NVL576 racks NVIDIA has previewed push past 600 kW, where liquid is no longer optional but existential.

Coolant — typically a water-glycol mix — enters the rack through manifolds at the top, flows down to cold plates that are bolted directly to each Rubin GPU, picks up heat, and returns to a coolant distribution unit (CDU) at the rack base. From the CDU, hot coolant flows to a building-scale facility cooling loop where the heat is finally rejected — to chillers, to cooling towers, to outside air or, in some new builds, to a dedicated heat-recovery loop that warms nearby buildings.

The cooling loop is itself an engineering object of its own. Flow rates, temperatures, pressure drops, leak detection, redundancy — every aspect must be managed to ensure that no Rubin ever exceeds its junction temperature and no failure of cooling can take more than a fraction of a rack offline.

Eighteen trays. Seventy-two GPUs. Roughly two hundred kilowatts. One copper spine. One liquid loop. The NVL72 rack is, in this sense, the first cabinet-scale supercomputer designed from a clean sheet of paper for AI workloads — and it is the smallest practical unit of the next decade's frontier compute.

Figure 14.1The NVL72 rack: 18 compute trays connected to a copper midplane spine that carries 260 TB/s of all-to-all NVLink 6 bandwidth, drawing roughly 190–230 kW depending on configuration.

Retrieve before you continue

Three questions on what you just read

Q1 Factual What is the NVL72 rack’s total HBM4 memory and its power draw range?

Q2 Conceptual Why did NVIDIA replace conventional cable harnesses with a copper midplane, and what are the concrete benefits?

Q3 Synthetic What goes wrong if the NVL72 all-to-all bandwidth is optimized in isolation — maximized without considering the training workloads that must run across it?