From Tokens to Embodied Minds · Drill cards · Chapter 13
Drills
Inside an H100/B200
10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.
In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Tokens to Embodied Minds · Ch 13, note type = Basic.
| Front | Back |
|---|---|
| What is the H100 SXM5 peak BF16 dense compute throughput? | 989 TFLOPS BF16 dense. |
| What is the H100 SXM5 HBM3 memory bandwidth? | 3.35 TB/s. |
| Define arithmetic intensity. | FLOPs executed divided by bytes of memory traffic — the x-axis of the roofline model. |
| What is the roofline ridge point on H100 in FLOPs per byte? | Approximately 295 FLOPs/byte (989 TFLOPS / 3.35 TB/s). |
| Why is LLM decode memory-bandwidth-bound at small batch sizes? | Each token generation reads all model weights for O(1) FLOPs per byte, giving arithmetic intensity near 2 — far below H100's ridge point of 295. |
| How many SMs does the H100 SXM5 have? | 132 streaming multiprocessors. |
| What is the per-SM shared memory size on H100? | 228 KB configurable L1/shared memory. |
| What NVLink bandwidth does the H100 DGX fabric provide? | 900 GB/s bidirectional between GPUs on the same NVSwitch fabric. |
| Why is tensor parallelism typically restricted to within-node? | NVLink (900 GB/s) makes all-reduce fast enough; cross-node InfiniBand (~400 Gb/s) is too slow for tight TP synchronization. |
| What tool gives you measured arithmetic intensity per kernel on NVIDIA GPUs? | Nsight Compute (ncu). |