Predict before you read

Before you read — how much all-to-all NVLink 6 bandwidth does a single NVL72 rack carry between its 72 GPUs?

Pick the order of magnitude. The chapter will tell you whether you were close.

From Sand to Superintelligence  ·  Chapter 14 of 42
Chapter 14

The NVL72 Rack

72 GPUs that move data faster than the internet

72 GPUs
per NVL72 rack
3.6 EFLOPS
FP4 per rack
260 TB/s
all-to-all NVLink bandwidth
~190–230 kW
rack power (Max Q / Max P)
Maturity ladder

The unit of frontier AI is not a chip. It is not a server. It is a rack.

This is a strange thing to say in 2026. For thirty years, the unit of computing was the server: a single machine, perhaps with several CPUs and a handful of GPUs, with its own power supply, fans, network ports, and operating system. Racks were just convenient ways to stack many servers. The compute happened inside the box, and the network glued boxes together.

That model has broken. In the NVL72 design, the rack itself is the machine. There is no individual server inside it that is meaningful on its own. Every component is part of one cabinet-scale computer.

The rack as a unit

An NVL72 rack contains 18 compute trays, each holding 4 Rubin GPUs and 2 Vera CPUs (typically as two Superchips). It also contains 9 NVLink switch trays — separate units whose only job is to switch GPU-to-GPU traffic — plus power shelves, networking SuperNICs, and management hardware. Add it up and you get 72 Rubin GPUs and 36 Vera CPUs in a single 19-inch rack.

That rack delivers approximately 3.6 exaflops of FP4 inference compute, with about 20.7 TB of HBM4 memory in aggregate (72 GPUs × 288 GB). It draws roughly 190–230 kilowatts depending on whether the rack is tuned for energy efficiency (Max Q) or peak performance (Max P) — comparable to a few hundred US homes — and dissipates that power as heat that has to leave the cabinet without melting it.

The communication fabric inside the rack is NVLink 6, the latest generation of NVIDIA's GPU-to-GPU interconnect. Every Rubin GPU has many NVLink lanes; the lanes from all 72 GPUs are routed through the 9 NVLink switch trays, configured so that any GPU can talk to any other GPU at full bandwidth simultaneously.

The aggregate is staggering: 260 terabytes per second of all-to-all bandwidth across the 72 GPUs. To put that in context, the global internet's aggregate cross-sectional bandwidth is, by some estimates, in the same neighborhood. Inside one rack, NVIDIA delivers roughly the bandwidth of the entire commodity internet.

This bandwidth is what allows the rack to behave as a single GPU. A model whose parameters do not fit in any single GPU can be sharded across all 72; the latency of moving activations between shards is small enough that it does not dominate the training step. This is what permits trillion-parameter mixture-of-experts models to be trained at all, and to be served at usable throughput.

The midplane spine

To carry 260 TB/s of bandwidth between trays without melting, NVIDIA replaced the conventional cable harness with a copper midplane: a large printed circuit board running vertically through the center of the rack, into which compute trays plug from one side and switch trays plug from the other. Signals flow through the midplane's copper traces, not through cables.

The benefits of this are enormous. Assembly time per tray drops from ~2 hours to about 5 minutes. Reliability rises because there are no cables to be miswired or to come loose. Cost per tray-link drops dramatically. The thermal envelope shrinks because copper traces dissipate less than active cabling. And the rack becomes a serviceable unit: a tray that fails can be slid out and replaced, without disturbing anything else.

Liquid cooling, by necessity

Two hundred kilowatts is too much for air. There is no fan large enough, no airflow loud enough, no heat sink dense enough to evacuate that much power from a single rack with air alone. NVL72 is liquid-cooled from the start — and the next-generation Kyber NVL576 racks NVIDIA has previewed push past 600 kW, where liquid is no longer optional but existential.

Coolant — typically a water-glycol mix — enters the rack through manifolds at the top, flows down to cold plates that are bolted directly to each Rubin GPU, picks up heat, and returns to a coolant distribution unit (CDU) at the rack base. From the CDU, hot coolant flows to a building-scale facility cooling loop where the heat is finally rejected — to chillers, to cooling towers, to outside air or, in some new builds, to a dedicated heat-recovery loop that warms nearby buildings.

The cooling loop is itself an engineering object of its own. Flow rates, temperatures, pressure drops, leak detection, redundancy — every aspect must be managed to ensure that no Rubin ever exceeds its junction temperature and no failure of cooling can take more than a fraction of a rack offline.

Eighteen trays. Seventy-two GPUs. Roughly two hundred kilowatts. One copper spine. One liquid loop. The NVL72 rack is, in this sense, the first cabinet-scale supercomputer designed from a clean sheet of paper for AI workloads — and it is the smallest practical unit of the next decade's frontier compute.

The NVL72 rack 72 GPUs in one cabinet, every one talking to every other across a copper midplane spine. NVLink 6 spine 260 TB/s all-to-all 18 COMPUTE TRAYS EACH TRAY: 4 Rubin GPUs 2 Vera CPUs 1 BlueField-4 DPU RACK TOTALS: 72 Rubin GPUs 36 Vera CPUs 3.6 EFLOPS FP4 ~600 kW power liquid-cooled FOOTPRINT: single 19" rack coolant loop
Figure 14.1The NVL72 rack: 18 compute trays connected to a copper midplane spine that carries 260 TB/s of all-to-all NVLink 6 bandwidth, drawing roughly 190–230 kW depending on configuration.
Retrieve before you continue

Three questions on what you just read

Q1 Factual What is the NVL72 rack’s total HBM4 memory and its power draw range?
Q2 Conceptual Why did NVIDIA replace conventional cable harnesses with a copper midplane, and what are the concrete benefits?
Q3 Synthetic What goes wrong if the NVL72 all-to-all bandwidth is optimized in isolation — maximized without considering the training workloads that must run across it?