Every inference cost estimate, every kernel optimization, and every hardware procurement decision for a production LLM stack reduces to one question: is your bottleneck FLOP throughput or memory bandwidth? The H100 SXM5 peaks at 989 TFLOPS BF16 dense and 3.35 TB/s HBM3 bandwidth — two numbers separated by a roofline ridge point at roughly 295 FLOPs per byte. Below that ridge, you are memory-bound; above it, compute-bound. Almost every serving problem for batch sizes under 32 lives below the ridge. Almost every training problem lives above it. Not knowing which side you are on is not a knowledge gap — it is a money gap. The memory hierarchy is the architecture: 256 KB register file per SM, 228 KB L1/shared memory per SM, 50 MB L2, then HBM at 3.35 TB/s — each level roughly 10× slower than the one before. The 132 SMs on H100 each contain 4 tensor core units capable of executing a 16×8×16 warp-level MMA instruction per cycle at BF16. NVLink 4 provides 900 GB/s bidirectional bandwidth between GPUs. The Blackwell B200 doubles most of these numbers and adds FP4 tensor cores; for DealLens running Llama 3.1 70B across two B200s, that is the difference between $0.003 and $0.0015 per 1K output tokens at production throughput.
The roofline model is your first tool
The roofline model (Williams et al., ACM Queue, 2009) plots achievable performance (FLOP/s) as a function of arithmetic intensity (FLOPs per byte of memory traffic). Two ceilings bound any kernel: peak compute (989 TFLOPS BF16 on H100) and peak bandwidth × intensity (3.35 TB/s × AI). The ridge point is where they meet: 989e12 / 3.35e12 ≈ 295 FLOPs/byte. A Llama 3 8B weight-load during decode at batch size 1 moves roughly 16 GB of weights per forward pass for effectively 2 FLOPs per weight byte (one multiply, one add) — arithmetic intensity of 2, deep in memory-bound territory. Prefill at batch 128 with a 4096-token prompt has arithmetic intensity of thousands — well above the ridge.
The practical consequence: optimize decode by reducing memory traffic (quantization, KV-cache compression, weight sharing), not by increasing FLOPs. Optimize prefill by maximizing FLOP density (FlashAttention tiling, fused kernels, large batch accumulation). Both optimizations exist, but they target different bottlenecks and applying one to the wrong phase wastes engineering time. NVIDIA's Nsight Compute (ncu) and Nsight Systems (nsys) give you the measured arithmetic intensity per kernel — read these before writing any optimization code.
The Blackwell B200 does not just scale the numbers: FP4 Tensor Cores (introduced in Blackwell) push the peak compute ceiling to roughly 4.5 PFLOPS for FP4, which shifts the ridge point further right. That matters only for compute-bound kernels; for memory-bound decode it does not help at all. HBM3e at 8 TB/s on the B200 is the spec that actually moves the needle for serving workloads — it directly raises the memory-bound ceiling.
SMs, tensor cores, and the warp execution model
An H100 SM contains 128 CUDA cores (for scalar FP32/INT32 work), 4 third-generation tensor core units, 256 KB register file, and 228 KB configurable L1/shared memory. A warp is 32 threads executing in lockstep; tensor core instructions operate at the warp group level (4 warps = 128 threads). The relevant instruction is the warp-level MMA: 16×8×16 shaped, BF16 inputs, FP32 accumulate, issuing every clock cycle. At 1.98 GHz boost clock, one SM delivers: 4 tensor core units × 16×8×16 × 2 ops × 1.98 GHz ≈ 32.4 TFLOPS BF16, and 132 SMs × 32.4 ≈ the 989 TFLOPS spec. This back-of-envelope tells you whether a kernel is actually using tensor cores (close to spec) or falling back to CUDA cores (32× slower for matmul).
Shared memory is the key lever for memory-bound kernels: data loaded from HBM into SRAM (228 KB L1) can be reused many times before eviction. FlashAttention exploits this: it tiles Q, K, V matrices into SRAM-resident chunks and reuses each chunk for the full attention computation before writing back to HBM. A naive attention implementation materializes the N×N attention matrix in HBM — O(N²) traffic. FlashAttention-2 reduces HBM traffic to O(N) for the attention score pass. That difference is why FlashAttention wins on memory-bound hardware, not because it does fewer FLOPs.
NVLink and multi-GPU topology
NVLink 4 (H100) gives 900 GB/s bidirectional between GPUs on the same NVSwitch fabric (8-GPU DGX H100). PCIe 5 gen between nodes is 128 GB/s — roughly 7× slower. This bandwidth gap is why tensor parallelism (TP) is restricted to within-node (where NVLink connects) and pipeline parallelism (PP) spans nodes (where PCIe or InfiniBand connects). The cost of an all-reduce in TP scales with model width × TP degree × 2 / NVLink bandwidth. At Llama 3 70B with TP=8 across NVLink, that is tolerable. Across InfiniBand at 400 Gb/s, it is not — bubble the pipeline instead.
For DealLens serving Llama 3.1 70B, a two-GPU NVLink pair with TP=2 halves per-GPU memory pressure from ~140 GB to ~70 GB while keeping all-reduce latency under 1 ms. For the JHU humanoid, the relevant node is a Jetson AGX Orin (275 TOPS INT8, 64 GB unified memory, 204 GB/s bandwidth) — a radically different roofline from H100, where nearly every inference operation is memory-bound and quantization is not optional.
Cost arithmetic for deployment
Back-of-envelope cost models are the difference between an engineering decision and a guess. H100 on-demand on major clouds runs roughly $2–3/GPU/hour in 2026. Serving Llama 3.1 8B with continuous batching at batch size 64 achieves roughly 3,000 output tokens per second on one H100, or about $0.0003 per 1K tokens at $2/hr. At batch size 1 (decode-bound) that drops to ~200 tokens/second — $0.0028 per 1K, 9× worse. Batching is the most important optimization for cost, not quantization.
For the JHU humanoid running GR00T N1.5 on a Jetson AGX Orin: the System 1 DiT must execute at 50 Hz (20 ms budget). At INT8, the DiT forward pass takes roughly 8–12 ms on Orin, leaving 8–12 ms for the rest of the control loop. FP16 is too slow. FP4 is not yet supported on Orin. INT8 quantization with TensorRT is the only path — this is a hard engineering constraint, not a preference.
The B200's FP4 Tensor Cores raise peak compute to ~4.5 PFLOPS but do not help serving throughput unless your bottleneck is compute-bound. For memory-bound decode at small batch sizes, the HBM3e bandwidth increase (from 3.35 to 8 TB/s) is worth more than all the FP4 compute.