Chapter 21 · The Clock — From Sand to Superintelligence

A heap of logic gates can compute, in the same way that a watch's pile of cogs can keep time — both require something to keep them moving in step. In a watch it is a balance wheel and a spring. In a chip it is the clock: a square wave, generated by an oscillator, distributed through a binary tree of buffers to every flip-flop on the chip, all of which sample their inputs on the same edge of every cycle. The clock is the heartbeat that turns combinational logic into time.

Why clocks exist

You could, in principle, build a chip without a global clock — an "asynchronous" chip, in which every gate fires whenever its inputs are ready. Some have been built. They are gorgeous. They are also a nightmare to design, debug, and verify. The synchronous, clocked design — every gate, every flip-flop, every register dancing to the same drum — won decades ago, and almost no commercial silicon has shipped without a global clock since.

The reason is correctness. Inside a chip, every signal takes a different amount of time to propagate. A short wire might be fast; a long wire, slow; a gate driving a heavy capacitive load, slower still. If a downstream flip-flop samples its input before all the upstream signals have finished settling, it captures a meaningless intermediate value — a glitch — and the calculation goes wrong.

The clock guarantees correctness by giving every signal in the chip a fixed, generous amount of time — a clock period — to settle, before any flip-flop samples it. The clock period is set, conservatively, by the slowest path between any two flip-flops on the chip. This path is called the critical path, and it is what determines a chip's maximum frequency. Speeding up a chip is, almost always, the art of finding and shortening critical paths.

The clock tree

Distributing the clock to every flip-flop — there are tens of millions in a typical CPU, billions in a GPU — is harder than it sounds. The signal must arrive at every flip-flop within a few picoseconds of every other, otherwise downstream logic sees inputs from different clock cycles and chaos ensues. This requirement is called low skew.

The trick is the clock tree: a balanced binary tree of buffers. The clock generator sits at the root; every level fans out to two more buffers, until the leaves drive the actual flip-flops. The wires are deliberately routed to be the same length on every branch, the buffers carefully sized so each one drives the same load. A modern clock tree is a piece of art — and it consumes about 10% of the chip's total power, just keeping time.

The clock frequency, finally, is set by the phase-locked loop (PLL), a feedback circuit that multiplies a slow, accurate reference (typically 100 MHz from a quartz crystal — yes, more silicon dioxide) up to the chip's operating frequency. A modern CPU has dozens of PLLs, one per voltage domain, and can shift between frequencies in microseconds in response to thermal or workload changes.

Pipelining

Once you have a clock, an enormous architectural idea becomes possible: pipelining. Conceptually, an instruction's execution has phases — fetch the instruction from memory, decode what it means, execute it on the ALU, write the result back to a register. If you do these one at a time, sequentially, each instruction takes (say) four cycles. Awful.

Instead, divide the chip into four stages, separated by flip-flops. While stage 1 fetches instruction n+3, stage 2 decodes n+2, stage 3 executes n+1, stage 4 writes back n. After the pipeline is full, every cycle, one instruction finishes and one new instruction begins. The latency per instruction is still four cycles, but the throughput is one instruction per cycle.

This trick — pipelining — is the single most important architectural invention in CPU design. It is also why CPUs grew so deep through the 1990s and early 2000s: the Pentium 4 Prescott (2004) ran a 31-stage pipeline, up from the original P4's twenty stages, and could clock above 3 GHz on a process where logic gates were still hundreds of nanometers. Modern CPUs are shallower (10-20 stages) — the marginal benefits ran out — but every modern processor pipelines aggressively.

Pipelining is the assembly line applied to logic. The same instruction goes through the same stages — but at any given moment, dozens of instructions are partway through, riding the same chip like cars on a Ford line.

Hazards and stalls

Pipelining is not free. If instruction n+1 needs the result of instruction n (a "data hazard"), the pipeline must wait — or arrange to forward the not-yet-written result directly from stage 3's output back to stage 2's input. If instruction n is a branch, the chip doesn't know which instruction comes after n until n is executed; modern CPUs make a guess (the branch predictor, accurate >95% of the time) and roll back if they were wrong. If instruction n needs to read from memory and the data isn't in the cache, the pipeline must stall for hundreds of cycles waiting for DRAM — and that is much of why we will spend an entire chapter (Ch. 24) on the memory pyramid.

Out-of-order execution

Modern CPUs go a step beyond pipelining: out-of-order execution. Instructions are fetched in order, but the chip looks ahead at dozens of instructions, picks any whose inputs are ready, and executes them in whatever order extracts the most parallelism. They are then "retired" back into program order so that, from the outside, the chip still appears to obey your code one line at a time. The lie is the entire point.

The clock has given us a way to make billions of gates cooperate in time. The pipeline has given us a way to make a single CPU walk and chew gum at once. Now we need to give it a job to do — a sequence of instructions to execute. That sequence is called a program, and the way the chip reads it is so simple it is almost embarrassing.

Figure 21.1Top: the clock waveform — a square wave at gigahertz frequency. Bottom: a 5-stage pipeline; once filled, one new instruction completes every cycle.

Retrieve before you continue

Three questions on what you just read

Q1 Factual What is the Pentium 4 Prescott’s pipeline depth, and how did that compare to the original Pentium 4?

Q2 Conceptual What is the ‘critical path’, and why does it set a chip’s maximum clock frequency?

Q3 Synthetic What goes wrong if you optimize pipeline depth alone to push clock frequency, without accounting for branch misprediction cost?