Reference

Glossary

84 terms — KL, FlashAttention, MoE, FSDP, vLLM, MCP, NeRF, OpenVLA, GR00T. Each entry links to the chapter where the idea first appears.

A

A2ACh 21
Open protocol announced by Google on April 9, 2025 for letting agents from different vendors discover and call each other. Complements MCP (which exposes tools) by exposing whole agents as collaborators.
ActivationCh 03
The output value(s) of a neural network layer after a non-linear transform (e.g. SwiGLU, GELU). Activations dominate memory during training because they must be cached for the backward pass.
Adam / AdamWCh 07
Adaptive moment estimation. AdamW decouples weight decay from the gradient update and is the de-facto optimizer for transformer pretraining. Stores two extra state tensors per parameter (m, v), which dominates optimizer-state memory.
AdapterCh 07
Parameter-efficient fine-tuning approach where small bottleneck layers are inserted between frozen transformer blocks. LoRA is the most common modern adapter.
AttentionCh 04
The QKV operation: each token attends to a weighted sum of all other tokens' value vectors, with weights coming from softmax(QKᵀ/√d). The single most expensive operation in transformer inference for long contexts.
AutogradCh 03
PyTorch's tape-based reverse-mode autodiff. Builds a dynamic computation graph during the forward pass and replays it backward to compute gradients via VJPs.

B

B200Ch 13
NVIDIA's Blackwell-generation flagship. Roughly 2.5× H100 for FP8 training and ships in 8-GPU NVL72 racks linked by 5th-gen NVLink. The reference platform for 2025-era frontier training.
BF16Ch 01
16-bit format with the same exponent range as FP32. Dominant numeric type for transformer training because it preserves dynamic range and avoids most loss-scaling complications.
BPECh 06
Greedy merge algorithm that builds a subword vocabulary from frequency statistics. Modern variants are byte-level (GPT-2/Llama style), giving any-byte coverage at the cost of long sequences for non-Latin scripts.

C

Chinchilla scalingCh 09
DeepMind's 2022 finding that, for a fixed FLOP budget, model size and training tokens should scale roughly equally — about 20 tokens per parameter. Corrected the over-large/under-trained Kaplan recipe.
Cross-entropyCh 02
−Σ p log q. Equivalent to compression: the average number of bits a model spends to predict the next token. Pretraining loss is exactly cross-entropy on the next-token distribution.
CUDACh 13
C++ extension and runtime for programming NVIDIA GPUs. Modern ML rarely writes CUDA directly — Triton, CUTLASS, or compiler stacks (torch.compile, cuDNN) generate kernels — but reading CUDA is essential for systems work.

D

Diffusion policyCh 30
Class of imitation-learning policies (Chi et al. 2023) that model the action distribution as the reverse of a noising process. Strong on multimodal demonstrations; foundation for π0/π0.5 and many VLA action heads.
DistillationCh 11
Knowledge transfer technique where a smaller model is trained on the soft probability distribution (or hidden states) of a larger one. KL divergence is the canonical objective. Used in Llama 3 8B/70B and most production small models.
DPOCh 16
Rafailov et al. 2023. RLHF reformulated as a closed-form classification loss over preference pairs, eliminating the explicit reward model and PPO loop. Far simpler to train; dominant alignment recipe through 2024.

E

Embodied AICh 25
Subfield studying agents whose policies must close the perception-action loop in physical or simulated environments. Combines vision, control, planning, and increasingly large pretrained models.
EvalCh 20
Any quantitative measurement of model quality. Modern evals span capability (MMLU, GSM8K), agentic (SWE-bench, GAIA), safety (HarmBench), and embodied (LIBERO, RoboArena). Building good evals is half of LLM engineering.
Expert Parallelism (EP)Ch 08
Parallelism strategy unique to mixture-of-experts: each GPU holds a subset of experts and routes tokens to whichever device owns the expert they need. Required for DeepSeek-V3-scale MoE training.

F

FlashAttentionCh 14
Dao et al. 2022/2023. Tiles the QKV computation in SRAM to avoid materializing the N² attention matrix in HBM. Makes long-context training and inference feasible. FlashAttention-3 is the current frontier on Hopper/Blackwell.
FP8Ch 11
Hopper-and-newer numeric format. E4M3 for weights/activations, E5M2 for gradients. Used in DeepSeek-V3 training and most 2025 frontier runs to roughly halve compute and memory versus BF16.
FSDPCh 15
PyTorch's sharded data parallelism. Each rank holds a slice of every parameter, gradient, and optimizer state, all-gathering on demand. The default for non-3D-parallel training up to ~70B parameters.

G

Gaussian Splatting (3DGS)Ch 27
Kerbl et al. SIGGRAPH 2023 (Aug 8, 2023). Replaces NeRF's volumetric MLP with explicit 3D Gaussians, enabling real-time rendering at >100 fps and faster training. Now the dominant 3D scene format for robotics.
GQACh 04
Compromise between multi-head and multi-query attention. Several query heads share one K/V head. Cuts KV-cache memory by 4–8× with negligible quality loss. Used in Llama 2 70B onward and most modern LLMs.
GR00TCh 33
Project unveiled at GTC 2024. GR00T N1.5 (June 11, 2025) is the open VLA backbone for humanoid manipulation, trained on a mix of real, synthetic, and human-video data. Designed to run on Jetson Thor.
Gradient CheckpointingCh 03
Recompute selected forward activations during the backward pass instead of caching them all. Cuts activation memory by ~√N for N layers at the cost of one extra forward pass. Standard for large-context training.
GRPOCh 16
DeepSeek-Math 2024. Replaces PPO's value function with the mean reward of a group of sampled completions, dramatically reducing memory. Used in DeepSeek-R1 and most reasoning-RL recipes.

H

H100Ch 13
Hopper-generation GPU with 80 GB HBM3, FP8 Transformer Engine, and 4th-gen NVLink. The workhorse of 2023–24 frontier training; superseded by B200 but still the global majority of installed capacity.
HBMCh 13
Stacked DRAM connected to the GPU through a silicon interposer. H100 ships HBM3 (~3 TB/s); B200 ships HBM3e (~8 TB/s). Memory bandwidth — not FLOPs — is the binding constraint for inference.

I

Imitation learningCh 30
Supervised learning on (state, expert action) pairs. Modern variants — diffusion policies, ACT, VLAs — are the dominant recipe for manipulation because RL in the real world is too slow.
InferenceCh 17
The forward-pass-only deployment phase. Modern LLM inference is bandwidth-bound during decode and FLOP-bound during prefill, which motivates KV-caches, paged attention, and disaggregated serving.
Isaac LabCh 29
Open-source RL framework on top of Isaac Sim, replacing IsaacGym. Runs thousands of parallel envs on a single GPU; used for sim-to-real RL on Unitree, Boston Dynamics, and Figure humanoids.

J

JIT compilationCh 14
Compiling code at runtime once shapes/dtypes are known. Triton, torch.compile, and JAX all use JIT to specialize kernels for the actual problem size, often beating hand-tuned CUDA.

K

KL divergenceCh 02
Σ p log(p/q). Asymmetric measure of how much information is lost when q approximates p. Appears in RLHF (KL-to-reference penalty), DPO, distillation, and variational inference.
KV-cacheCh 19
During autoregressive decoding, K and V for each prior token are reused unchanged, so they are cached. KV-cache size — quadratic in sequence length, linear in batch — dominates inference memory for long contexts.

L

LangGraphCh 22
LangChain's stateful, graph-structured runtime for multi-step agents. Each node is a function or LLM call; edges are conditional routes. Better than chains for cyclic, retry-heavy, or human-in-the-loop flows.
LeRobotCh 34
Open-source library and dataset hub for real-world robot learning. SmolVLA-450M (June 3, 2025) is its flagship VLA, designed to run on a laptop. Hardware: SO-100/101 arms.
LLM-dCh 18
Red Hat / Google / IBM project (introduced Nov 21, 2025). Kubernetes-native, vLLM-based disaggregated serving with prefix-aware routing and prefill/decode separation across pods.
LoRA / QLoRACh 07
Hu et al. 2021. Trains two small low-rank matrices ΔW = BA instead of full weights. QLoRA adds 4-bit base weights so 65B models fine-tune on a single 48 GB GPU.

M

MCPCh 21
Anthropic's open protocol (released Nov 25, 2024) for exposing tools and data sources to LLMs in a uniform way. The closest thing to USB-C for AI agents; adopted across Claude, OpenAI, and the wider ecosystem.
Memory WallCh 10
Decode-time inference bottleneck: GPU FLOPs grow ~3× per generation, HBM bandwidth grows <2×. Most decode steps are now bandwidth-bound, which is why batching, GQA, MQA, and quantization matter so much.
MoECh 08
Sparse architecture where a router selects k of N expert FFNs per token. Active parameters ≪ total parameters, decoupling capacity from inference cost. Dominant frontier architecture (DeepSeek-V3, Mixtral, GPT-4-class).
MuJoCoCh 29
Fast contact-rich rigid-body simulator originally by Emo Todorov; now open-source under DeepMind. Standard for RL research and many sim-to-real pipelines, alongside Isaac Sim and PyBullet.

N

NeRFCh 27
Mildenhall et al. 2020. Represents a 3D scene as an MLP from (x,y,z,θ,φ) to (RGB,density), rendered by ray marching. Largely superseded by 3D Gaussian Splatting for real-time use cases.
NVLinkCh 13
Direct GPU↔GPU links; 900 GB/s on H100, 1.8 TB/s on B200. NVLink Switch fabrics scale this to 72-GPU domains (NVL72), enabling tensor parallelism across many GPUs without crossing PCIe.

O

OpenVLACh 31
Kim et al. June 2024. 7B Llama-2 backbone with SigLIP+DINOv2 vision encoder, fine-tuned on 970k Open X-Embodiment trajectories. The reference open VLA before π0 / GR00T.
Optimizer stateCh 15
For AdamW, two FP32 buffers (m, v) per parameter — 8 bytes each, plus the master FP32 weights. Optimizer state is the largest single contributor to training memory after activations.

P

Paged AttentionCh 17
vLLM's contribution. Stores KV-cache in fixed-size blocks instead of contiguously, enabling near-zero memory fragmentation, prefix sharing, and continuous batching. The biggest single win in 2023-era LLM serving.
Pipeline ParallelismCh 15
Different layer ranges live on different GPUs; micro-batches flow through the pipeline. Used together with tensor parallelism in 3D-parallel training. GPipe and 1F1B are the canonical schedules.
PPOCh 16
Schulman et al. 2017. Clipped policy-gradient algorithm; the original RLHF workhorse. Still used in some labs but largely supplanted by DPO (offline) and GRPO (reasoning) for LLM alignment.
Prefill / DecodeCh 18
Prefill processes the prompt in parallel (FLOP-bound); decode emits one token at a time (bandwidth-bound). Disaggregated serving (llm-d, Mooncake) splits these across separate GPU pools.
PretrainingCh 07
The bulk of compute cost in an LLM's life. Modern pipeline: FineWeb-Edu / DCLM / The Stack v2 → BPE tokenization → 1–10T tokens → BF16 + FP8 mixed-precision training over thousands of GPUs.

Q

QuantizationCh 11
Mapping FP16/BF16 tensors to lower-bit representations (INT8, INT4, FP8). GPTQ, AWQ, and SmoothQuant are the standard post-training methods; W4A16 (4-bit weights, 16-bit activations) is the inference default.

R

RAGCh 20
Pattern where retrieval over an external corpus precedes generation. Modern advanced RAG includes hybrid search, rerankers, query rewriting, and graph-traversal — and is increasingly replaced by long-context + tool use.
Reasoning modelCh 09
Inference-time-scaling regime kicked off by OpenAI o1 (Sept 2024) and DeepSeek-R1 (Jan 2025). Trained with RL on verifiable rewards; pays more tokens for harder problems.
Red teamingCh 24
Structured probing for jailbreaks, prompt injection, harmful outputs, and bias. Modern programs combine human red teamers, automated attacks (PAIR, GCG), and benchmark suites (HarmBench, AgentDojo).
Ring AttentionCh 10
Liu, Zaharia, Abbeel 2023. KV blocks rotate around a ring of GPUs so each device only ever holds a slice of the sequence. Enables million-token context training without per-GPU memory blowup.
RLHFCh 16
Original alignment recipe (Ouyang et al. 2022): SFT → reward model from preference pairs → PPO against that reward. Largely replaced by DPO/GRPO but still the conceptual frame.
RMSNormCh 05
Zhang & Sennrich 2019. Drops the mean-subtraction step from LayerNorm, leaving only RMS scaling. Cheaper and equally effective; standard in Llama, Mistral, and most modern transformer blocks.
RoPECh 05
Su et al. 2021. Encodes position by rotating Q and K vectors in pairs by an angle proportional to position. Generalizes better to longer contexts than learned absolute embeddings; YaRN extends it further.
RT-2Ch 31
First major VLA: a PaLI-X / PaLM-E backbone fine-tuned to emit discretized action tokens. Established the recipe — co-train on web VQA + robot trajectories — that OpenVLA, π0, and GR00T all follow.

S

SAM 2Ch 26
Meta FAIR (Aug 1, 2024). Extends SAM to video by adding a memory-conditioned mask decoder. Universal segmentation/tracker that has rapidly become the default front-end for robot perception.
Scaling lawsCh 09
Empirical observation that loss falls predictably with parameters, data, and compute. Kaplan 2020 and Chinchilla 2022 are the canonical references; inference-time scaling (o1) is the 2024 extension.
Self-attentionCh 04
The transformer's core operation: each token computes Q, K, V from its own embedding and attends across the whole sequence. Distinct from cross-attention (decoder attending to encoder).
SFTCh 16
Phase that follows pretraining: train on curated instruction/response pairs with cross-entropy. Cheap, fast, and the foundation for everything alignment does on top.
Sim-to-RealCh 29
Closing the reality gap with domain randomization, system identification, real-data fine-tuning, or learned actuator models. Now the dominant recipe for humanoid locomotion and dexterous manipulation.
SmolVLACh 32
SmolVLA-450M (June 3, 2025). Built on SmolLM2 + SigLIP, trained on community LeRobot data. Designed to run on a single consumer GPU; the most accessible entry point to VLA research.
Speculative decodingCh 19
A small draft model proposes k tokens; the big model verifies them in one parallel forward pass. 2–3× speedup on most workloads with no quality loss. Medusa, EAGLE, and lookahead decoding are common variants.
SwiGLUCh 05
Shazeer 2020. Replaces FFN GELU with a gated variant; ~+0.5% MMLU at the same compute. Used in Llama, Mistral, Qwen, and effectively every modern open transformer.

T

Tensor Parallelism (TP)Ch 15
Megatron-LM 1D split: each GPU holds part of W and computes its slice of XW, then all-reduces. Standard inside an 8-GPU node; combined with PP and DP for 3D parallelism.
TokenizationCh 06
Subword segmentation (BPE, Unigram, byte-level). Choice of tokenizer determines sequence lengths, vocabulary size, and the model's behavior on numbers, code, and non-Latin scripts.
Transformer blockCh 05
Pre-norm RMSNorm → multi-head attention with RoPE+GQA → residual → RMSNorm → SwiGLU FFN → residual. Effectively unchanged since 2022 across all open frontier models; almost all progress is in scale, data, and post-training.
TritonCh 14
OpenAI's tile-based language for writing CUDA-class kernels in Python. Used for FlashAttention, fused MoE, and custom quantization kernels. Compiled JIT through MLIR; the practical alternative to writing CUDA.
TRT-LLMCh 17
NVIDIA's highly tuned C++/CUDA inference engine. Faster than vLLM on NVIDIA-only deployments, at the cost of compile times and less flexibility. Often paired with Triton Inference Server.

V

VAECh 30
Kingma & Welling 2013. Probabilistic encoder/decoder pair trained with reconstruction + KL regularization. Backbone of latent diffusion (Stable Diffusion) and many robot world models.
Vector databaseCh 20
Stores high-dimensional embeddings and serves approximate nearest-neighbor queries (HNSW, IVF-PQ). Used as the retrieval back-end of RAG. Examples: Pinecone, Weaviate, pgvector, Qdrant.
VJPCh 03
Reverse-mode AD computes ∂L/∂x = vᵀ J for each op without ever materializing J. Every PyTorch operation registers a VJP rule (its `backward`).
VLACh 31
Class of robot foundation models that map (image, language instruction) → action tokens. Started with RT-2 (2023); current open frontier is OpenVLA, π0/π0.5, GR00T N1.5, and SmolVLA.
vLLMCh 17
UC Berkeley project that introduced paged attention, continuous batching, and prefix caching. The de-facto open-source LLM server; the substrate of llm-d and most cloud inference.

W

Weight DecayCh 07
Adds a λ‖θ‖² term to the loss (or equivalently shrinks θ each step). AdamW decouples this from the gradient update. Standard λ for transformers: ~0.1.
World ModelCh 33
Neural simulator that predicts future states given actions. DreamerV3, Genie, GAIA-2, and NVIDIA Cosmos are recent examples. Increasingly central to embodied learning as scarce real data is augmented with imagined rollouts.

Y

YaRNCh 10
Peng et al. 2023. NTK-aware RoPE rescaling that extends a model's context window past its training length with a short fine-tune. Used in Mistral, Qwen, and many long-context Llama derivatives.

Z

ZeROCh 15
DeepSpeed's three-stage memory partitioning scheme: shard optimizer states (Z1), gradients (Z2), parameters (Z3). FSDP is essentially ZeRO-3 in PyTorch. Foundational for training models that don't fit on one GPU.
Zero-shotCh 12
Asking a model to do a task purely from instructions, without demonstrations. Distinct from few-shot (k examples in the prompt) and fine-tuned. The standard mode for instruction-tuned LLMs.

Π

π0 / π0.5Ch 32
Pi-zero (Oct 2024) and π0.5 (April 22, 2025, arXiv:2504.16054). Flow-matching action heads on top of a PaliGemma-style VLM; π0.5 generalizes to unseen homes and long-horizon mobile manipulation.