Glossary

84 terms — KL, FlashAttention, MoE, FSDP, vLLM, MCP, NeRF, OpenVLA, GR00T. Each entry links to the chapter where the idea first appears.

A

A2ACh 21: Open protocol announced by Google on April 9, 2025 for letting agents from different vendors discover and call each other. Complements MCP (which exposes tools) by exposing whole agents as collaborators.
ActivationCh 03: The output value(s) of a neural network layer after a non-linear transform (e.g. SwiGLU, GELU). Activations dominate memory during training because they must be cached for the backward pass.
Adam / AdamWCh 07: Adaptive moment estimation. AdamW decouples weight decay from the gradient update and is the de-facto optimizer for transformer pretraining. Stores two extra state tensors per parameter (m, v), which dominates optimizer-state memory.
AdapterCh 07: Parameter-efficient fine-tuning approach where small bottleneck layers are inserted between frozen transformer blocks. LoRA is the most common modern adapter.
AttentionCh 04: The QKV operation: each token attends to a weighted sum of all other tokens' value vectors, with weights coming from softmax(QKᵀ/√d). The single most expensive operation in transformer inference for long contexts.
AutogradCh 03: PyTorch's tape-based reverse-mode autodiff. Builds a dynamic computation graph during the forward pass and replays it backward to compute gradients via VJPs.

B

B200Ch 13: NVIDIA's Blackwell-generation flagship. Roughly 2.5× H100 for FP8 training and ships in 8-GPU NVL72 racks linked by 5th-gen NVLink. The reference platform for 2025-era frontier training.
BF16Ch 01: 16-bit format with the same exponent range as FP32. Dominant numeric type for transformer training because it preserves dynamic range and avoids most loss-scaling complications.
BPECh 06: Greedy merge algorithm that builds a subword vocabulary from frequency statistics. Modern variants are byte-level (GPT-2/Llama style), giving any-byte coverage at the cost of long sequences for non-Latin scripts.

C

Chinchilla scalingCh 09: DeepMind's 2022 finding that, for a fixed FLOP budget, model size and training tokens should scale roughly equally — about 20 tokens per parameter. Corrected the over-large/under-trained Kaplan recipe.
Cross-entropyCh 02: −Σ p log q. Equivalent to compression: the average number of bits a model spends to predict the next token. Pretraining loss is exactly cross-entropy on the next-token distribution.
CUDACh 13: C++ extension and runtime for programming NVIDIA GPUs. Modern ML rarely writes CUDA directly — Triton, CUTLASS, or compiler stacks (torch.compile, cuDNN) generate kernels — but reading CUDA is essential for systems work.

D

Diffusion policyCh 30: Class of imitation-learning policies (Chi et al. 2023) that model the action distribution as the reverse of a noising process. Strong on multimodal demonstrations; foundation for π0/π0.5 and many VLA action heads.
DistillationCh 11: Knowledge transfer technique where a smaller model is trained on the soft probability distribution (or hidden states) of a larger one. KL divergence is the canonical objective. Used in Llama 3 8B/70B and most production small models.
DPOCh 16: Rafailov et al. 2023. RLHF reformulated as a closed-form classification loss over preference pairs, eliminating the explicit reward model and PPO loop. Far simpler to train; dominant alignment recipe through 2024.

E

Embodied AICh 25: Subfield studying agents whose policies must close the perception-action loop in physical or simulated environments. Combines vision, control, planning, and increasingly large pretrained models.
EvalCh 20: Any quantitative measurement of model quality. Modern evals span capability (MMLU, GSM8K), agentic (SWE-bench, GAIA), safety (HarmBench), and embodied (LIBERO, RoboArena). Building good evals is half of LLM engineering.
Expert Parallelism (EP)Ch 08: Parallelism strategy unique to mixture-of-experts: each GPU holds a subset of experts and routes tokens to whichever device owns the expert they need. Required for DeepSeek-V3-scale MoE training.

F

FlashAttentionCh 14: Dao et al. 2022/2023. Tiles the QKV computation in SRAM to avoid materializing the N² attention matrix in HBM. Makes long-context training and inference feasible. FlashAttention-3 is the current frontier on Hopper/Blackwell.
FP8Ch 11: Hopper-and-newer numeric format. E4M3 for weights/activations, E5M2 for gradients. Used in DeepSeek-V3 training and most 2025 frontier runs to roughly halve compute and memory versus BF16.
FSDPCh 15: PyTorch's sharded data parallelism. Each rank holds a slice of every parameter, gradient, and optimizer state, all-gathering on demand. The default for non-3D-parallel training up to ~70B parameters.

G

Gaussian Splatting (3DGS)Ch 27: Kerbl et al. SIGGRAPH 2023 (Aug 8, 2023). Replaces NeRF's volumetric MLP with explicit 3D Gaussians, enabling real-time rendering at >100 fps and faster training. Now the dominant 3D scene format for robotics.
GQACh 04: Compromise between multi-head and multi-query attention. Several query heads share one K/V head. Cuts KV-cache memory by 4–8× with negligible quality loss. Used in Llama 2 70B onward and most modern LLMs.
GR00TCh 33: Project unveiled at GTC 2024. GR00T N1.5 (June 11, 2025) is the open VLA backbone for humanoid manipulation, trained on a mix of real, synthetic, and human-video data. Designed to run on Jetson Thor.
Gradient CheckpointingCh 03: Recompute selected forward activations during the backward pass instead of caching them all. Cuts activation memory by ~√N for N layers at the cost of one extra forward pass. Standard for large-context training.
GRPOCh 16: DeepSeek-Math 2024. Replaces PPO's value function with the mean reward of a group of sampled completions, dramatically reducing memory. Used in DeepSeek-R1 and most reasoning-RL recipes.

H

H100Ch 13: Hopper-generation GPU with 80 GB HBM3, FP8 Transformer Engine, and 4th-gen NVLink. The workhorse of 2023–24 frontier training; superseded by B200 but still the global majority of installed capacity.
HBMCh 13: Stacked DRAM connected to the GPU through a silicon interposer. H100 ships HBM3 (~3 TB/s); B200 ships HBM3e (~8 TB/s). Memory bandwidth — not FLOPs — is the binding constraint for inference.

I

Imitation learningCh 30: Supervised learning on (state, expert action) pairs. Modern variants — diffusion policies, ACT, VLAs — are the dominant recipe for manipulation because RL in the real world is too slow.
InferenceCh 17: The forward-pass-only deployment phase. Modern LLM inference is bandwidth-bound during decode and FLOP-bound during prefill, which motivates KV-caches, paged attention, and disaggregated serving.
Isaac LabCh 29: Open-source RL framework on top of Isaac Sim, replacing IsaacGym. Runs thousands of parallel envs on a single GPU; used for sim-to-real RL on Unitree, Boston Dynamics, and Figure humanoids.

J

JIT compilationCh 14: Compiling code at runtime once shapes/dtypes are known. Triton, torch.compile, and JAX all use JIT to specialize kernels for the actual problem size, often beating hand-tuned CUDA.

K

KL divergenceCh 02: Σ p log(p/q). Asymmetric measure of how much information is lost when q approximates p. Appears in RLHF (KL-to-reference penalty), DPO, distillation, and variational inference.
KV-cacheCh 19: During autoregressive decoding, K and V for each prior token are reused unchanged, so they are cached. KV-cache size — quadratic in sequence length, linear in batch — dominates inference memory for long contexts.

L

LangGraphCh 22: LangChain's stateful, graph-structured runtime for multi-step agents. Each node is a function or LLM call; edges are conditional routes. Better than chains for cyclic, retry-heavy, or human-in-the-loop flows.
LeRobotCh 34: Open-source library and dataset hub for real-world robot learning. SmolVLA-450M (June 3, 2025) is its flagship VLA, designed to run on a laptop. Hardware: SO-100/101 arms.
LLM-dCh 18: Red Hat / Google / IBM project (introduced Nov 21, 2025). Kubernetes-native, vLLM-based disaggregated serving with prefix-aware routing and prefill/decode separation across pods.
LoRA / QLoRACh 07: Hu et al. 2021. Trains two small low-rank matrices ΔW = BA instead of full weights. QLoRA adds 4-bit base weights so 65B models fine-tune on a single 48 GB GPU.

M

MCPCh 21: Anthropic's open protocol (released Nov 25, 2024) for exposing tools and data sources to LLMs in a uniform way. The closest thing to USB-C for AI agents; adopted across Claude, OpenAI, and the wider ecosystem.
Memory WallCh 10: Decode-time inference bottleneck: GPU FLOPs grow ~3× per generation, HBM bandwidth grows <2×. Most decode steps are now bandwidth-bound, which is why batching, GQA, MQA, and quantization matter so much.
MoECh 08: Sparse architecture where a router selects k of N expert FFNs per token. Active parameters ≪ total parameters, decoupling capacity from inference cost. Dominant frontier architecture (DeepSeek-V3, Mixtral, GPT-4-class).
MuJoCoCh 29: Fast contact-rich rigid-body simulator originally by Emo Todorov; now open-source under DeepMind. Standard for RL research and many sim-to-real pipelines, alongside Isaac Sim and PyBullet.

N

NeRFCh 27: Mildenhall et al. 2020. Represents a 3D scene as an MLP from (x,y,z,θ,φ) to (RGB,density), rendered by ray marching. Largely superseded by 3D Gaussian Splatting for real-time use cases.
NVLinkCh 13: Direct GPU↔GPU links; 900 GB/s on H100, 1.8 TB/s on B200. NVLink Switch fabrics scale this to 72-GPU domains (NVL72), enabling tensor parallelism across many GPUs without crossing PCIe.

O

OpenVLACh 31: Kim et al. June 2024. 7B Llama-2 backbone with SigLIP+DINOv2 vision encoder, fine-tuned on 970k Open X-Embodiment trajectories. The reference open VLA before π0 / GR00T.
Optimizer stateCh 15: For AdamW, two FP32 buffers (m, v) per parameter — 8 bytes each, plus the master FP32 weights. Optimizer state is the largest single contributor to training memory after activations.

P

Paged AttentionCh 17: vLLM's contribution. Stores KV-cache in fixed-size blocks instead of contiguously, enabling near-zero memory fragmentation, prefix sharing, and continuous batching. The biggest single win in 2023-era LLM serving.
Pipeline ParallelismCh 15: Different layer ranges live on different GPUs; micro-batches flow through the pipeline. Used together with tensor parallelism in 3D-parallel training. GPipe and 1F1B are the canonical schedules.
PPOCh 16: Schulman et al. 2017. Clipped policy-gradient algorithm; the original RLHF workhorse. Still used in some labs but largely supplanted by DPO (offline) and GRPO (reasoning) for LLM alignment.
Prefill / DecodeCh 18: Prefill processes the prompt in parallel (FLOP-bound); decode emits one token at a time (bandwidth-bound). Disaggregated serving (llm-d, Mooncake) splits these across separate GPU pools.
PretrainingCh 07: The bulk of compute cost in an LLM's life. Modern pipeline: FineWeb-Edu / DCLM / The Stack v2 → BPE tokenization → 1–10T tokens → BF16 + FP8 mixed-precision training over thousands of GPUs.

Q

QuantizationCh 11: Mapping FP16/BF16 tensors to lower-bit representations (INT8, INT4, FP8). GPTQ, AWQ, and SmoothQuant are the standard post-training methods; W4A16 (4-bit weights, 16-bit activations) is the inference default.

R

RAGCh 20: Pattern where retrieval over an external corpus precedes generation. Modern advanced RAG includes hybrid search, rerankers, query rewriting, and graph-traversal — and is increasingly replaced by long-context + tool use.
Reasoning modelCh 09: Inference-time-scaling regime kicked off by OpenAI o1 (Sept 2024) and DeepSeek-R1 (Jan 2025). Trained with RL on verifiable rewards; pays more tokens for harder problems.
Red teamingCh 24: Structured probing for jailbreaks, prompt injection, harmful outputs, and bias. Modern programs combine human red teamers, automated attacks (PAIR, GCG), and benchmark suites (HarmBench, AgentDojo).
Ring AttentionCh 10: Liu, Zaharia, Abbeel 2023. KV blocks rotate around a ring of GPUs so each device only ever holds a slice of the sequence. Enables million-token context training without per-GPU memory blowup.
RLHFCh 16: Original alignment recipe (Ouyang et al. 2022): SFT → reward model from preference pairs → PPO against that reward. Largely replaced by DPO/GRPO but still the conceptual frame.
RMSNormCh 05: Zhang & Sennrich 2019. Drops the mean-subtraction step from LayerNorm, leaving only RMS scaling. Cheaper and equally effective; standard in Llama, Mistral, and most modern transformer blocks.
RoPECh 05: Su et al. 2021. Encodes position by rotating Q and K vectors in pairs by an angle proportional to position. Generalizes better to longer contexts than learned absolute embeddings; YaRN extends it further.
RT-2Ch 31: First major VLA: a PaLI-X / PaLM-E backbone fine-tuned to emit discretized action tokens. Established the recipe — co-train on web VQA + robot trajectories — that OpenVLA, π0, and GR00T all follow.

S

SAM 2Ch 26: Meta FAIR (Aug 1, 2024). Extends SAM to video by adding a memory-conditioned mask decoder. Universal segmentation/tracker that has rapidly become the default front-end for robot perception.
Scaling lawsCh 09: Empirical observation that loss falls predictably with parameters, data, and compute. Kaplan 2020 and Chinchilla 2022 are the canonical references; inference-time scaling (o1) is the 2024 extension.
Self-attentionCh 04: The transformer's core operation: each token computes Q, K, V from its own embedding and attends across the whole sequence. Distinct from cross-attention (decoder attending to encoder).
SFTCh 16: Phase that follows pretraining: train on curated instruction/response pairs with cross-entropy. Cheap, fast, and the foundation for everything alignment does on top.
Sim-to-RealCh 29: Closing the reality gap with domain randomization, system identification, real-data fine-tuning, or learned actuator models. Now the dominant recipe for humanoid locomotion and dexterous manipulation.
SmolVLACh 32: SmolVLA-450M (June 3, 2025). Built on SmolLM2 + SigLIP, trained on community LeRobot data. Designed to run on a single consumer GPU; the most accessible entry point to VLA research.
Speculative decodingCh 19: A small draft model proposes k tokens; the big model verifies them in one parallel forward pass. 2–3× speedup on most workloads with no quality loss. Medusa, EAGLE, and lookahead decoding are common variants.
SwiGLUCh 05: Shazeer 2020. Replaces FFN GELU with a gated variant; ~+0.5% MMLU at the same compute. Used in Llama, Mistral, Qwen, and effectively every modern open transformer.

T

Tensor Parallelism (TP)Ch 15: Megatron-LM 1D split: each GPU holds part of W and computes its slice of XW, then all-reduces. Standard inside an 8-GPU node; combined with PP and DP for 3D parallelism.
TokenizationCh 06: Subword segmentation (BPE, Unigram, byte-level). Choice of tokenizer determines sequence lengths, vocabulary size, and the model's behavior on numbers, code, and non-Latin scripts.
Transformer blockCh 05: Pre-norm RMSNorm → multi-head attention with RoPE+GQA → residual → RMSNorm → SwiGLU FFN → residual. Effectively unchanged since 2022 across all open frontier models; almost all progress is in scale, data, and post-training.
TritonCh 14: OpenAI's tile-based language for writing CUDA-class kernels in Python. Used for FlashAttention, fused MoE, and custom quantization kernels. Compiled JIT through MLIR; the practical alternative to writing CUDA.
TRT-LLMCh 17: NVIDIA's highly tuned C++/CUDA inference engine. Faster than vLLM on NVIDIA-only deployments, at the cost of compile times and less flexibility. Often paired with Triton Inference Server.

V

VAECh 30: Kingma & Welling 2013. Probabilistic encoder/decoder pair trained with reconstruction + KL regularization. Backbone of latent diffusion (Stable Diffusion) and many robot world models.
Vector databaseCh 20: Stores high-dimensional embeddings and serves approximate nearest-neighbor queries (HNSW, IVF-PQ). Used as the retrieval back-end of RAG. Examples: Pinecone, Weaviate, pgvector, Qdrant.
VJPCh 03: Reverse-mode AD computes ∂L/∂x = vᵀ J for each op without ever materializing J. Every PyTorch operation registers a VJP rule (its `backward`).
VLACh 31: Class of robot foundation models that map (image, language instruction) → action tokens. Started with RT-2 (2023); current open frontier is OpenVLA, π0/π0.5, GR00T N1.5, and SmolVLA.
vLLMCh 17: UC Berkeley project that introduced paged attention, continuous batching, and prefix caching. The de-facto open-source LLM server; the substrate of llm-d and most cloud inference.

W

Weight DecayCh 07: Adds a λ‖θ‖² term to the loss (or equivalently shrinks θ each step). AdamW decouples this from the gradient update. Standard λ for transformers: ~0.1.
World ModelCh 33: Neural simulator that predicts future states given actions. DreamerV3, Genie, GAIA-2, and NVIDIA Cosmos are recent examples. Increasingly central to embodied learning as scarce real data is augmented with imagined rollouts.

Y

YaRNCh 10: Peng et al. 2023. NTK-aware RoPE rescaling that extends a model's context window past its training length with a short fine-tune. Used in Mistral, Qwen, and many long-context Llama derivatives.

Z

ZeROCh 15: DeepSpeed's three-stage memory partitioning scheme: shard optimizer states (Z1), gradients (Z2), parameters (Z3). FSDP is essentially ZeRO-3 in PyTorch. Foundational for training models that don't fit on one GPU.
Zero-shotCh 12: Asking a model to do a task purely from instructions, without demonstrations. Distinct from few-shot (k examples in the prompt) and fine-tuned. The standard mode for instruction-tuned LLMs.

Π

π0 / π0.5Ch 32: Pi-zero (Oct 2024) and π0.5 (April 22, 2025, arXiv:2504.16054). Flow-matching action heads on top of a PaliGemma-style VLM; π0.5 generalizes to unseen homes and long-horizon mobile manipulation.