What does RoPE's relative-position property mean for context extension?	RoPE encodes relative position (pos_q - pos_k) in the QK dot product. This makes it possible to extend context by adjusting the base frequency without destroying all learned position relationships — only the resolution of long-range positions changes.
What is the 'lost in the middle' phenomenon in long-context models?	Retrieval accuracy is highest for content at the beginning and end of the context window and lowest for content in the middle. This applies even to models trained at the full context length (Liu et al., 2023, arXiv:2307.03172).
How does position interpolation extend context length and what is its failure mode?	Scale position indices by L_train/L_new to compress them into the trained range. Failure: high-frequency RoPE dimensions lose resolution — the model cannot distinguish positions that are now mapped to similar indices, degrading long-range retrieval.
What is YaRN and how does it improve on position interpolation?	YaRN (NTK-aware scaling) selectively adjusts RoPE dimensions — high-frequency dimensions stay near trained resolution, low-frequency dimensions receive most of the scaling. This preserves short-range precision while extending the coarse long-range position range.
Calculate KV-cache memory for Llama 3 70B at 128K context in BF16 with GQA G=8.	2 * 80 layers * 8 kv_heads * 128 head_dim * 131072 seq * 2 bytes = ~42 GB. More than half an A100 80GB's HBM is consumed by KV-cache at this context length.
What is ring attention and why does it enable long context without memory explosion on a single GPU?	Ring attention shards the sequence across GPUs — each GPU holds a fraction of the sequence. Attention is computed by circulating K and V in a ring. No single GPU needs to hold the full N×N attention matrix or the full KV-cache.
What is the first KV-cache intervention for a deployed 128K-context model that is exceeding memory budget?	INT8 quantization of the KV-cache (via torch.quantize_per_tensor on the cache tensors). Halves KV-cache memory with negligible retrieval quality loss — the first intervention before considering architectural changes.
Why is long context not a substitute for retrieval in a VC deal analysis system?	A 10K-memo corpus contains far more tokens than any single context window. Retrieval identifies relevant candidates; long context synthesizes the retrieved candidates. Running all 10K memos through a 128K-context model is ~1000x more expensive and retrieves no additional quality.
At what context length do current production models (Llama 3, Qwen, DeepSeek) claim to work reliably?	128K tokens is the practical reliability ceiling for most 2024-2025 production models. Claims of 1M+ context exist but degrade measurably on needle-in-a-haystack evals outside the training distribution.
For the JHU humanoid, is long context relevant to the inference pipeline?	Indirectly. The VLA (SmolVLA or GR00T N1.5) processes short context (a few frames of visual history plus a task instruction — typically under 2K tokens). Long context is relevant if you want to process multi-camera video histories or long-horizon task plans, but not for per-step policy inference.
