From Tokens to Embodied Minds  ·  Drill cards · Chapter 13
Drills

Inside an H100/B200

10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.

10 cards due for review

In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Tokens to Embodied Minds · Ch 13, note type = Basic.

FrontBack
What is the H100 SXM5 peak BF16 dense compute throughput?989 TFLOPS BF16 dense.
What is the H100 SXM5 HBM3 memory bandwidth?3.35 TB/s.
Define arithmetic intensity.FLOPs executed divided by bytes of memory traffic — the x-axis of the roofline model.
What is the roofline ridge point on H100 in FLOPs per byte?Approximately 295 FLOPs/byte (989 TFLOPS / 3.35 TB/s).
Why is LLM decode memory-bandwidth-bound at small batch sizes?Each token generation reads all model weights for O(1) FLOPs per byte, giving arithmetic intensity near 2 — far below H100's ridge point of 295.
How many SMs does the H100 SXM5 have?132 streaming multiprocessors.
What is the per-SM shared memory size on H100?228 KB configurable L1/shared memory.
What NVLink bandwidth does the H100 DGX fabric provide?900 GB/s bidirectional between GPUs on the same NVSwitch fabric.
Why is tensor parallelism typically restricted to within-node?NVLink (900 GB/s) makes all-reduce fast enough; cross-node InfiniBand (~400 Gb/s) is too slow for tight TP synchronization.
What tool gives you measured arithmetic intensity per kernel on NVIDIA GPUs?Nsight Compute (ncu).