Mixture of Experts

10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.

In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Tokens to Embodied Minds · Ch 08, note type = Basic.

Front	Back
What is the core idea of Mixture of Experts (MoE)?	Replace the dense FFN with N expert FFNs and a learned router that activates only k experts per token. Parameter count scales with N; FLOPs per token scale with k/N — high capacity at low compute.
What is expert collapse and why is it the primary MoE training failure mode?	The router develops a preference for a small subset of experts. Those experts receive more gradient, improve faster, reinforcing the preference. Most experts become undertrained, wasting parameter capacity.
What is the auxiliary load-balancing loss and what is its drawback?	L_aux = alpha * sum_i f_i * P_i penalizes routing imbalance. Drawback: it conflicts with the main language modeling objective during training and fine-tuning, particularly on domain-specific data.
What is expert parallelism (EP) in distributed training?	Shard expert weights across GPUs — each GPU holds a subset of experts and processes only the tokens routed to it. Requires all-to-all communication for token dispatch and result gathering.
What is DeepSeek-V3's total vs activated parameter count?	671B total parameters, 37B activated per token (top-8 of 256 fine-grained experts plus shared experts). The activated FLOPs are comparable to a 37B dense model.
What is expert capacity and why is it necessary?	Maximum tokens each expert processes per batch = capacity_factor * batch_size/N. Without a capacity cap, popular experts receive O(batch_size) tokens — infeasible for parallel computation. Overflow tokens are dropped or passed through unchanged.
Why does MoE require loading all expert weights into memory even during inference?	Any token can route to any expert — the router's decision is input-dependent and cannot be precomputed. All expert weight matrices must be memory-resident to allow arbitrary routing.
What is the k parameter in top-k routing and what values are typical in production MoE models?	k = number of experts activated per token. Mixtral uses k=2 of 8. DeepSeek-V3 uses k=8 of 256 fine-grained experts plus shared experts. k=2 is the minimum for performance; higher k increases FLOPs but also model capacity utilization.
How does expert-choice routing differ from token-choice routing?	Token-choice: each token selects its top-k experts (standard). Expert-choice (Zhou et al., 2022): each expert selects its top-C tokens from the batch. Expert-choice guarantees perfect load balance but drops some tokens that no expert selects.
For a 30B-A3B MoE model vs a 7B dense model, what is the inference FLOP comparison?	At k=3B active parameters out of 30B total, inference FLOPs are comparable to a 3B dense model, not 30B. The 30B-A3B model has ~4x higher parameter capacity at roughly the same per-token compute as a dense 3B model.