From Tokens to Embodied Minds  ·  Drill cards · Chapter 14
Drills

FlashAttention and Triton

10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.

10 cards due for review

In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Tokens to Embodied Minds · Ch 14, note type = Basic.

FrontBack
Why does naive attention require O(N²) HBM memory?It writes the full N×N attention score matrix to HBM between the QK and softmax-AV matmuls.
What two scalars does online softmax track per row?m (running max of logits) and l (running sum of exp(logit - m)).
How does FlashAttention-2 reduce HBM traffic to O(N)?It tiles Q, K, V within SRAM, runs online softmax tile-by-tile, and writes only the output O and logsumexp back to HBM — never the N×N intermediate.
What SRAM size is available per SM on H100?228 KB configurable L1/shared memory.
What is the Triton @triton.jit decorator's primary role?It JIT-compiles a tile-parallel Python function into warp-level GPU instructions for the target architecture.
What are the two Hopper features FlashAttention-3 exploits that FA2 cannot use?Tensor Memory Accelerator (TMA) for async SRAM loads, and warp-group specialization that pipelines async memory transfers with MMA computation.
What speedup does FlashAttention-2 provide over PyTorch naive attention at long sequences?Roughly 2–4× wall-clock, depending on sequence length and hardware.
What FlashAttention backward-pass trick avoids storing the N×N matrix?Recomputing attention scores from the saved logsumexp scalar rather than storing the materialized softmax matrix.
Can FlashAttention-3 run on Ampere (A100) GPUs?No — FA3 requires Hopper-specific hardware (H100 or later).
What does torch.nn.functional.scaled_dot_product_attention use on Ampere+ with CUDA >= 11.6?It automatically dispatches to FlashAttention-2 when available.