From Sand to Superintelligence · Drill cards · Chapter 28
Drills
The GPU's Different Mind
10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.
In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Sand to Silicon · Ch 28, note type = Basic.
| Front | Back |
|---|---|
| How many threads are in one GPU warp? | 32. |
| What does SIMT stand for? | Single instruction, multiple threads — NVIDIA’s execution model where hardware groups threads into warps and issues one instruction to the whole warp. |
| What is warp divergence? | When threads in a warp take different branches; the hardware executes both paths serially, masking inactive threads — halving throughput at best. |
| What do tensor cores do? | Perform a small matrix multiply-and-accumulate as a single instruction; introduced in Volta (2017) and refined through Hopper and Blackwell. |
| What matrix shape does one tensor-core instruction operate on? | 16×16×16 (a tile of that shape; the chapter lists it as the stat-row figure). |
| What is occupancy? | The ratio of active warps to maximum possible warps on an SM; high occupancy is needed to hide memory latency by keeping the SM saturated. |
| What is memory coalescing? | Adjacent threads in a warp reading adjacent words, so the hardware can fold many reads into a single memory transaction and fully use available bandwidth. |
| How much HBM bandwidth does a Rubin GPU have? | ~22 TB/s from its HBM4 stacks. |
| What is the maximum number of resident threads on a Hopper-class SM? | ~2,048. |
| What frameworks exist specifically to engineer coalesced memory layouts for neural-network kernels? | Triton and FlashAttention (the chapter names both). |