Predict before you read

Before you read — what is the primary advantage of ZeRO-3 (FSDP) over standard Data Parallel (DDP)?

Think about what DDP keeps duplicated on every GPU.

From Tokens to Embodied Minds  ·  Chapter 15 of 36
Chapter 15

Distributed training

DP, TP, PP, FSDP, ZeRO — and why ZeRO-3 wins for most teams

parallelism axes: data, tensor, pipeline, and sequence
ZeRO-3
shards parameters + gradients + optimizer state — eliminates optimizer memory duplication
3D
parallelism = TP within node × PP across nodes × DP across replicas
Maturity ladder

A 70B model in BF16 requires 140 GB just for weights. Add AdamW optimizer states (FP32 master weights, m, v: 12 bytes/parameter), gradients (2 bytes/parameter BF16), and activations — and you need roughly 700 GB for a training step at batch size 1. No single GPU has 700 GB. Distributed training is not a performance optimization for large models; it is a correctness requirement. The question is not whether to parallelize but which combination of four axes — data, tensor, pipeline, sequence — produces the highest throughput for your cluster topology. The answer for most teams in 2026 is: ZeRO-3 (PyTorch FSDP) for data parallelism as the outer axis, because it eliminates optimizer state duplication and requires no model surgery. Tensor parallelism within each node (NVLink bandwidth makes it cheap), pipeline parallelism across nodes only when the model is too large even for TP, and sequence parallelism for ultra-long context. This is not a universal answer — DeepSeek-V3 trained with expert parallelism as a fifth axis — but it is the right starting point for a team without a 1,024-GPU cluster and a Megatron-LM specialist.

The four parallelism axes

Data parallelism (DP): replicate the full model on each GPU, partition the minibatch across replicas, all-reduce gradients after each backward pass. DDP (PyTorch's standard DP implementation) keeps full parameter/gradient/optimizer copies on every GPU — the memory wall for large models. FSDP (Fully Sharded Data Parallel, the ZeRO-3 equivalent) shards parameters, gradients, and optimizer states across the DP group. Before each layer's forward, each GPU all-gathers the shard it needs; after the backward, it discards it. The result: per-GPU memory scales with 1/num_dp_gpus instead of full-model.

Tensor parallelism (TP, Megatron-style): shard each matmul across GPUs within a layer. For a linear Y = XW, partition W column-wise across TP GPUs and reduce the outputs. Each TP communication is a small, synchronous all-reduce. At TP=8 with NVLink (900 GB/s), this adds ~0.5 ms per layer — acceptable. At TP=8 across PCIe or InfiniBand, the communication dwarfs the compute. This is why TP degree is always set to fit within a single NVLink domain (typically 8 GPUs on a DGX H100).

Pipeline parallelism (PP): shard layers across GPU groups. GPU 0 runs layers 0–9, GPU 1 runs layers 10–19, etc. The 1F1B (one-forward-one-backward) schedule from Megatron-LM (Narayanan et al., arXiv:2104.04473, 2021) minimizes the pipeline bubble to roughly 1/(num_microbatches) fraction of idle time. PP is the axis you reach for when the model is too large even after TP — typically for models above 100B parameters on a fixed cluster.

ZeRO-3 and FSDP in practice

ZeRO stages 1/2/3 (Rajbhandari et al., arXiv:1910.02054) progressively shard more: Stage 1 shards optimizer states only (4× memory reduction), Stage 2 adds gradient sharding (8× reduction), Stage 3 adds parameter sharding (64× reduction for large models with AdamW). PyTorch FSDP implements ZeRO-3. Zhao et al. (arXiv:2304.11277, April 2023) document the engineering that made FSDP production-ready — specifically, the mixed-precision policy (BF16 compute, FP32 master weights held in the all-reduce buffer) and the prefetching that overlaps all-gathers with compute.

For a 70B model with AdamW: parameters (BF16) 140 GB, FP32 master weights 280 GB, Adam m+v 560 GB, gradients (BF16) 140 GB = 1,120 GB total. ZeRO-3 across 16 H100 GPUs: 1,120 / 16 = 70 GB per GPU — exactly fitting one H100 80GB with 10 GB left for activations. Gradient checkpointing (recomputing activations during backward) gives another ~3× activation memory reduction at the cost of ~30% extra compute. CMU 11-868 Assignment 4 works through this arithmetic explicitly.

The all-gather communication cost of ZeRO-3 is the tradeoff. At DP=16 with 100 Gb/s InfiniBand, the all-gather for one transformer layer's weights (e.g., 1B parameters = 2 GB BF16) takes ~160 ms — unacceptable. In practice, ZeRO-3 is combined with NVLink for TP within nodes (sub-ms communication) and InfiniBand only for the ZeRO gradient all-reduce, which is overlapped with the backward pass. The effective communication overhead in a well-tuned setup is 10–20% of step time.

3D parallelism and expert parallelism

3D parallelism = TP × PP × DP. For a 4-node × 8-GPU H100 cluster (32 GPUs) training Llama 3.1 70B: TP=8 (within node, NVLink), PP=2 (across the two halves of each node pair), DP=2 (two replicas of the TP×PP world). This gives roughly 70 GB per GPU before activations, which is feasible with gradient checkpointing. The Llama 3 herd paper (Meta, arXiv:2407.21783, July 2024) trained 405B with TP=8, PP=16, DP=many — expert parallelism for MoE was not needed since Llama 3 is dense.

For MoE models (DeepSeek-V3, Qwen3-MoE), expert parallelism (EP) is a fifth axis: each expert is assigned to specific GPUs, and routing logic sends each token's computation to the right GPU. DeepSeek-V3 uses EP=64 in a 256-GPU training run. llm-d (Red Hat, Nov 21, 2025) treats EP as a first-class concept in inference as well — prefill pods and decode pods can each have different EP configurations, because the expert utilization pattern differs between the two phases.

Practical choices for fine-tuning at scale

For the JHU capstone — fine-tuning GR00T N1.5 or SmolVLA on custom demonstration data — you are not pretraining. Gradient checkpointing + FSDP on a single 4-GPU A100 node is sufficient for SmolVLA-450M. For GR00T N1.5 (the exact size is not public, but the DiT component appears to be in the 1–3B range), FSDP across 8 GPUs with TP=1 (small enough that TP overhead is not justified) is the right setup. You do not need 3D parallelism unless you are training from scratch.

For DealLens, the relevant training scenario is fine-tuning a Llama 3.1 8B or 70B scoring model on proprietary deal data. 8B fits on a single H100 with QLoRA; 70B requires FSDP across 4 H100s or QLoRA with 4-bit quantization on 2 H100s. The distributed training chapter is thus an indirect dependency for DealLens — understanding it tells you when you need an H100 cluster vs when a single GPU suffices.

Sequence parallelism

Sequence parallelism (Korthikanti et al., 2022) shards the sequence dimension across TP GPUs, distributing LayerNorm and Dropout across devices. At seq=1M tokens, it is the fifth required axis. For anything under 128K, it adds communication overhead without benefit — skip it unless you are training ring-attention-style long-context models.

3D Parallelism: DP × TP × PP Sharding3D Parallelism: Tensor × Pipeline × DataNode 0 (NVLink — TP=4)GPU 0Layers 0–19GPU 1Layers 0–19GPU 2Layers 0–19GPU 3Layers 0–19← Tensor Parallel (column/row matmul shards) →Node 1 (NVLink — TP=4)GPU 4Layers 20–39GPU 5Layers 20–39GPU 6Layers 20–39GPU 7Layers 20–39← Tensor Parallel →PP (activations)Data Parallel (ZeRO-3 / FSDP)Each node pair = one DP replicaGradients all-reduced across replicasParams sharded 1/DP_size per GPUGradients sharded + reduce-scatterOptimizer states sharded (FP32 m, v)70B AdamW: 1120 GB totalZeRO-3 across 16 GPUs: 70 GB/GPUCommunication: all-gather before fwd,reduce-scatter after bwd
Figure 15.13D parallelism for a 70B model on 2 nodes × 4 GPUs. Tensor parallelism (blue) shards matmuls within each NVLink node. Pipeline parallelism (orange) partitions layers across nodes with 1F1B scheduling. ZeRO-3 (right) shards parameters, gradients, and optimizer states across the DP dimension — reducing per-GPU memory from 1,120 GB to 70 GB at DP=16.
Retrieve before you continue

Three questions on what you just read

Q1 Factual What does ZeRO-3 shard across GPUs, and what is the approximate per-GPU memory footprint for a 70B model with AdamW across 16 H100s?
Q2 Conceptual Why is tensor parallelism constrained to within a single NVLink domain, and why does pipeline parallelism span nodes?
Q3 Synthetic For fine-tuning SmolVLA-450M for the JHU humanoid capstone on a single 4-GPU A100 node, what is the simplest sufficient distributed strategy and why?