What are Q, K, and V in self-attention and what role does each play?	Q (query): what the current token is looking for. K (key): what each token offers as a match signal. V (value): the actual content retrieved if the match is strong. Output is a weighted sum of V vectors weighted by QK similarity.
Why is causal masking implemented by setting upper-triangle entries to -inf before softmax?	exp(-inf) = 0, so those positions contribute zero weight to the softmax output. After softmax, each token only receives information from itself and earlier tokens — enforcing autoregressive causality.
What is grouped-query attention (GQA) and how does it reduce KV-cache memory?	GQA uses G shared K and V projections for H query heads (G << H). Each K/V group serves H/G query heads. KV-cache memory reduces by H/G since there are G instead of H K and V matrices.
What is the computational complexity of full attention in terms of sequence length N?	O(N^2) in both memory and compute — the full N×N score matrix must be computed and stored.
What is sliding-window attention and why is it insufficient for long VC memo analysis?	Sliding-window attention restricts each token to attend to the W nearest neighbors — O(N*W) instead of O(N^2). Insufficient for DealLens because semantic dependencies between deal terms span the full document, not just local context.
What is multi-query attention (MQA) and why was GQA preferred over it?	MQA uses G=1 — all query heads share a single K and V. Maximally efficient but causes measurable quality degradation. GQA at G=8 recovers ~all memory efficiency with ~no quality loss, making MQA obsolete for new models.
Why do multiple attention heads help representational capacity?	Each head independently attends over a different linear projection of the input, learning to capture different relationships (syntactic, semantic, positional). A single head can only attend to one subspace at a time.
During autoregressive decoding, why is the KV-cache necessary?	Without it, computing attention at each new token requires recomputing K and V for all previous tokens — O(N^2) total per generation. The KV-cache stores all past K and V, reducing per-step cost to O(N) at the cost of O(N*d_model*layers) memory.
How does cross-attention differ from self-attention?	In cross-attention, Q comes from one sequence (e.g., action tokens in a DiT) and K, V come from a different sequence (e.g., VLM embeddings). In self-attention, Q, K, and V all come from the same sequence.
What is the shape of the output of multi-head attention given batch B, seq N, heads H, head dim d_k?	Shape (B, N, H*d_k) = (B, N, d_model). Each head produces (B, N, d_k); after concatenation across H heads and an output projection, the shape returns to (B, N, d_model).
