From Tokens to Embodied Minds · Drill cards · Chapter 08
Drills
Mixture of Experts
10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.
In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Tokens to Embodied Minds · Ch 08, note type = Basic.
| Front | Back |
|---|---|
| What is the core idea of Mixture of Experts (MoE)? | Replace the dense FFN with N expert FFNs and a learned router that activates only k experts per token. Parameter count scales with N; FLOPs per token scale with k/N — high capacity at low compute. |
| What is expert collapse and why is it the primary MoE training failure mode? | The router develops a preference for a small subset of experts. Those experts receive more gradient, improve faster, reinforcing the preference. Most experts become undertrained, wasting parameter capacity. |
| What is the auxiliary load-balancing loss and what is its drawback? | L_aux = alpha * sum_i f_i * P_i penalizes routing imbalance. Drawback: it conflicts with the main language modeling objective during training and fine-tuning, particularly on domain-specific data. |
| What is expert parallelism (EP) in distributed training? | Shard expert weights across GPUs — each GPU holds a subset of experts and processes only the tokens routed to it. Requires all-to-all communication for token dispatch and result gathering. |
| What is DeepSeek-V3's total vs activated parameter count? | 671B total parameters, 37B activated per token (top-8 of 256 fine-grained experts plus shared experts). The activated FLOPs are comparable to a 37B dense model. |
| What is expert capacity and why is it necessary? | Maximum tokens each expert processes per batch = capacity_factor * batch_size/N. Without a capacity cap, popular experts receive O(batch_size) tokens — infeasible for parallel computation. Overflow tokens are dropped or passed through unchanged. |
| Why does MoE require loading all expert weights into memory even during inference? | Any token can route to any expert — the router's decision is input-dependent and cannot be precomputed. All expert weight matrices must be memory-resident to allow arbitrary routing. |
| What is the k parameter in top-k routing and what values are typical in production MoE models? | k = number of experts activated per token. Mixtral uses k=2 of 8. DeepSeek-V3 uses k=8 of 256 fine-grained experts plus shared experts. k=2 is the minimum for performance; higher k increases FLOPs but also model capacity utilization. |
| How does expert-choice routing differ from token-choice routing? | Token-choice: each token selects its top-k experts (standard). Expert-choice (Zhou et al., 2022): each expert selects its top-C tokens from the batch. Expert-choice guarantees perfect load balance but drops some tokens that no expert selects. |
| For a 30B-A3B MoE model vs a 7B dense model, what is the inference FLOP comparison? | At k=3B active parameters out of 30B total, inference FLOPs are comparable to a 3B dense model, not 30B. The 30B-A3B model has ~4x higher parameter capacity at roughly the same per-token compute as a dense 3B model. |