From Tokens to Embodied Minds · Drill cards · Chapter 14
Drills
FlashAttention and Triton
10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.
In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Tokens to Embodied Minds · Ch 14, note type = Basic.
| Front | Back |
|---|---|
| Why does naive attention require O(N²) HBM memory? | It writes the full N×N attention score matrix to HBM between the QK and softmax-AV matmuls. |
| What two scalars does online softmax track per row? | m (running max of logits) and l (running sum of exp(logit - m)). |
| How does FlashAttention-2 reduce HBM traffic to O(N)? | It tiles Q, K, V within SRAM, runs online softmax tile-by-tile, and writes only the output O and logsumexp back to HBM — never the N×N intermediate. |
| What SRAM size is available per SM on H100? | 228 KB configurable L1/shared memory. |
| What is the Triton @triton.jit decorator's primary role? | It JIT-compiles a tile-parallel Python function into warp-level GPU instructions for the target architecture. |
| What are the two Hopper features FlashAttention-3 exploits that FA2 cannot use? | Tensor Memory Accelerator (TMA) for async SRAM loads, and warp-group specialization that pipelines async memory transfers with MMA computation. |
| What speedup does FlashAttention-2 provide over PyTorch naive attention at long sequences? | Roughly 2–4× wall-clock, depending on sequence length and hardware. |
| What FlashAttention backward-pass trick avoids storing the N×N matrix? | Recomputing attention scores from the saved logsumexp scalar rather than storing the materialized softmax matrix. |
| Can FlashAttention-3 run on Ampere (A100) GPUs? | No — FA3 requires Hopper-specific hardware (H100 or later). |
| What does torch.nn.functional.scaled_dot_product_attention use on Ampere+ with CUDA >= 11.6? | It automatically dispatches to FlashAttention-2 when available. |