Predict before you read

Before you read — how many Rubin GPUs does a single DGX SuperPOD contain?

Pick the order of magnitude. The chapter will tell you whether you were close.

From Sand to Superintelligence  ·  Chapter 16 of 42
Chapter 16

The AI Factory

SuperPODs, exaflops, and the new economy of intelligence

1,008
Rubin GPUs per SuperPOD
50.4 EFLOPS
FP4 compute
1,046 TB
fast HBM4 memory
10×
lower inference cost vs Blackwell
Maturity ladder

The endpoint of the silicon supply chain is not a chip. It is not even a rack. It is something NVIDIA calls an AI factory: a building, or part of a building, dedicated entirely to running AI workloads at industrial scale. Inside, racks are arranged in rows, networking spines connect them, cooling pipes feed them, transformers feed the cooling pipes. The building exists to convert electricity into computed thought.

This is where the journey ends, and where the work begins.

The DGX SuperPOD

The smallest practical "AI factory" unit is the DGX SuperPOD with DGX Vera Rubin NVL72: a configuration of 14 NVL72 racks, totaling 1,008 Rubin GPUs and 504 Vera CPUs, delivering 50.4 exaflops of FP4 inference compute and 1,046 TB of fast HBM4 memory in aggregate.

That is one cluster. Hyperscalers buy them in dozens. The largest deployments planned for 2026–2027 will exceed 100,000 Rubin-class GPUs in a single coherent system — a scale at which the network connecting them becomes a more important engineering problem than the GPUs themselves.

Networking the unspeakable

Inside a single rack, NVLink 6 handles all GPU-to-GPU traffic. Between racks, the network shifts to InfiniBand or Ethernet. NVIDIA's Quantum-X800 InfiniBand and Spectrum-X Ethernet fabrics provide rack-to-rack bandwidth at hundreds of gigabits per second per link. BlueField-4 DPUs handle the protocol offload — encryption, congestion control, telemetry — leaving the Rubin GPUs free to do what they do best.

The networking is not afterthought. For training a multi-trillion-parameter model across thousands of racks, the gradient-synchronization step at every iteration touches every GPU and demands ultra-low latency, ultra-high throughput collectives. A poorly tuned network can cut training throughput in half. A well-tuned one delivers near-linear scaling out to tens of thousands of GPUs.

The physical plant

The buildings that house all this look less like data centers and more like factories. An AI factory's electrical service is measured in tens or hundreds of megawatts. Substations and transformers are part of the building's design. Some sites are co-located with dedicated power generation — natural gas plants, hydroelectric dams, nuclear reactors.

The cooling load is comparable. Hundreds of megawatts of GPU power become hundreds of megawatts of heat that has to leave the building. Some go to dry coolers on the roof. Some go to evaporative cooling towers. Some, increasingly, go to district heating loops that warm office buildings or greenhouses nearby. The water and air infrastructure is no longer a service utility — it is a structural part of the design.

Even the floor matters. The reinforced concrete must support tens of thousands of pounds per square meter. The cable trays must accommodate not just signal cabling but liquid coolant manifolds. The construction time for a new AI factory, from groundbreaking to first GPU online, is now measured in years and is a binding constraint on how fast the industry can grow.

The economy of tokens

What does all of this produce? Computed inference. The fundamental output of an AI factory is tokens: pieces of language, or pixels, or video frames, or molecular structures, generated by the models running on the racks. Every interaction with a frontier AI assistant is a small purchase from this output.

The unit economics matter. NVIDIA claims that Rubin reduces inference token cost by roughly 10× compared to Blackwell, and reduces by 4× the number of GPUs needed to train a mixture-of-experts model of comparable size. At the scale of a hyperscaler, those factors are decisive. They are why entire generations of GPUs are deployed in such numbers, why old generations are retired aggressively, why the silicon supply chain we have just walked through is operating at the limit of its capacity.

Looking back along the chain

Step backward, then. The Rubin GPU running in a rack in Virginia is, in the strictest sense, a piece of quartzite from a mountain in North Carolina, melted and re-melted, purified to one part contamination per billion, grown into a perfect crystal, sliced into a wafer, polished to atomic flatness, patterned eighty times by light at 13.5 nanometers, etched by plasma and doped by ion beams, wired into a labyrinth of copper, packaged with stacks of memory on a silicon interposer, fused with a CPU into a superchip, slotted into a tray, plugged into a copper midplane, cooled by liquid running through pipes, watched over by an internal RAS engine, networked to a thousand of its siblings, and instructed to compute.

The journey is roughly six months long. It crosses four continents. It involves perhaps eighty suppliers, twenty governments, and several technologies that no single nation can build alone. At every stage, what makes it possible is precision so extreme it borders on the metaphysical: nine nines of purity, single nanometers of flatness, single atomic layers of deposition, single nanometer features printed by single droplets of vaporized tin.

The output, at the end, is an instrument that can answer questions about itself.

The DGX SuperPOD Fourteen racks, fifty exaflops, one logical machine — the smallest unit of frontier AI. QUANTUM-X800 INFINIBAND / SPECTRUM-X ETHERNET FABRIC 14 racks 144 trays 1,008 GPUs 504 CPUs 50.4 EFLOPS FP4 inference 1,046 TB fast HBM4 memory
Figure 16.1A DGX SuperPOD with 14 NVL72 racks, totaling 1,008 Rubin GPUs and 50.4 exaflops of FP4 inference compute, networked through Quantum-X800 InfiniBand or Spectrum-X Ethernet.
Retrieve before you continue

Three questions on what you just read

Q1 Factual What networking chips handle rack-to-rack traffic in a DGX SuperPOD, and what handles protocol offload?
Q2 Conceptual Why does the chapter call an AI factory a building that ‘converts electricity into computed thought’ rather than simply a large data center?
Q3 Synthetic What goes wrong if a hyperscaler optimizes GPU cluster scale alone — buying 100,000 Blackwell GPUs instead of waiting for Rubin — ignoring per-generation efficiency differences?