A study path · 36 chapters · three parts · ~5 hour read

From Tokens to Embodied Minds

A graduate-level path from the linear algebra inside a transformer to the dual-system humanoid foundation models that turn tokens into actions. Part I is what a frontier model is. Part II is how it ships. Part III is how it gets a body. No filler. No motivational language. Where the field is hype, this guide says so.

36 chapters 3 parts ~12 hr/wk for 26 weeks

Begin reading My study dashboard →

Part I Tokens → models

01 Ch 01 Linear algebra you actually use 02 Ch 02 Probability, entropy, KL 03 Ch 03 Backprop, autograd, and the chain rule on GPUs 04 Ch 04 Attention from scratch 05 Ch 05 The transformer block, end to end 06 Ch 06 Tokenization is a design decision 07 Ch 07 Pretraining at scale 08 Ch 08 Mixture of Experts 09 Ch 09 Scaling laws — Kaplan, Chinchilla, and after 10 Ch 10 Long context — beyond 128K 11 Ch 11 Quantization, distillation, pruning 12 Ch 12 How to read a frontier paper in 30 minutes

Part II Models → systems

13 Ch 13 Inside an H100/B200 14 Ch 14 FlashAttention and Triton 15 Ch 15 Distributed training 16 Ch 16 RLHF, DPO, GRPO, and reasoning RL 17 Ch 17 vLLM, TensorRT-LLM, SGLang 18 Ch 18 llm-d and disaggregated inference 19 Ch 19 KV-cache, speculative decoding, Medusa 20 Ch 20 Advanced RAG and evals 21 Ch 21 MCP and A2A — the agent protocols 22 Ch 22 LangGraph and orchestration that survives 23 Ch 23 Observability — traces, evals, regression 24 Ch 24 Red-teaming, jailbreaks, prompt injection

Part III Models → bodies

25 Ch 25 Classical robotics in one chapter 26 Ch 26 Computer vision foundations — CS231n, redux 27 Ch 27 3D perception — NeRF, Gaussian Splatting, SAM 2 28 Ch 28 Reinforcement learning, refreshed 29 Ch 29 Sim-to-real and the Isaac stack 30 Ch 30 Diffusion policies and action chunking 31 Ch 31 Vision-Language-Action models — RT-2 and OpenVLA 32 Ch 32 π0, π0.5, and SmolVLA 33 Ch 33 NVIDIA GR00T N1 and N1.5 with Isaac Lab 34 Ch 34 Hugging Face LeRobot and the open robotics stack 35 Ch 35 Safety and alignment for embodied AI 36 Ch 36 Capstone — a humanoid home assistant, end to end

§ Part I

Foundations of Modern AI

Twelve chapters on what a frontier transformer actually is — the linear algebra it spends its time on, the loss it minimizes, and the architectural choices that survived 2017–2026.

Linear algebra you actually use

The five operations that consume every forward pass are not interesting for their math — they are interesting because their numerical behavior at BF16/FP8 is what bites in practice.

Probability, entropy, KL

Cross-entropy is the only loss that survives at scale because next-token prediction is maximum-likelihood compression. KL divergence is the leash that shows up unchanged in RLHF, DPO, and distillation.

Backprop, autograd, and the chain rule on GPUs

PyTorch's autograd is a tape-based dynamic graph, not a static DAG. Most production training failures root-cause to autograd misuse, not model architecture.

Attention from scratch

Self-attention is a soft, content-addressable lookup. The four production variants exist because of KV-cache economics, not modelling power — and the math behind each choice is simpler than most tutorials admit.

The transformer block, end to end

Once you internalize the modern transformer block — pre-norm, RMSNorm, GQA, SwiGLU, RoPE, no biases — every frontier architecture paper becomes a 10-minute read.

Tokenization is a design decision

Tokenizer choice silently sets a ceiling on multilingual quality, math accuracy, and code generation — and the tokenizer is usually the last thing engineers audit when a model underperforms.

Pretraining at scale

Pretraining is 90% data engineering. The four-stage pipeline — collect, filter, dedupe, curriculum — determines model quality as much as architecture does, and the field learned this the hard way with Chinchilla.

Mixture of Experts

MoE gives you 671B parameters with 37B activated per token. The hard part is not the router — it is load balancing across 256 experts without auxiliary losses that destabilize training.

Scaling laws — Kaplan, Chinchilla, and after

Scaling laws are planning tools, not prediction tools. Kaplan said make models bigger. Chinchilla corrected to scale tokens equally. The reasoning era broke both by adding inference-time compute as a third scaling axis.

Long context — beyond 128K

Every 1M-token model that ships still degrades at needle-in-a-haystack on real documents outside its training distribution. Long context without retrieval discipline is theatre.

Quantization, distillation, pruning

INT4 weights with FP16 activations is the inference sweet spot. FP8 training is the 2024-2025 Hopper/Blackwell standard. Structured pruning works; unstructured pruning does not on real hardware.

How to read a frontier paper in 30 minutes

A mechanical six-step procedure turns arXiv triage from passive reading into active evidence extraction. Build a paper-graph, not a paper-pile.

§ Part II

LLM Systems Engineering

Twelve chapters on shipping models in production — H100/B200, FlashAttention, FSDP/ZeRO, RLHF/DPO/GRPO, vLLM/TensorRT-LLM/SGLang, llm-d, MCP, A2A, LangGraph, observability, red-teaming.

Inside an H100/B200

The roofline model is not optional. Know whether your kernel is compute-bound or memory-bound before you write a single line of optimization code.

FlashAttention and Triton

Naive attention materializes an N×N matrix in HBM. FlashAttention-2 never does — and that difference is the reason your long-context model fits in memory at all.

Distributed training

Training a 70B model on 8 GPUs requires four distinct parallelism strategies applied simultaneously. Get the sharding wrong and you waste 40% of your compute to communication.

RLHF, DPO, GRPO, and reasoning RL

RLHF put RL in the loop. DPO removed the reward model. GRPO brought RL back — with verifiable rewards. Three rounds of the same argument, each sharper.

vLLM, TensorRT-LLM, SGLang

Three serving frameworks, three architectural bets. PagedAttention wins on generality, RadixAttention wins on agentic prefix reuse, TensorRT-LLM wins on raw NVIDIA throughput.

llm-d and disaggregated inference

Prefill and decode have different hardware profiles. Running them on the same GPU wastes resources. llm-d separates them into distinct Kubernetes pods — and the throughput benefit is immediate.

KV-cache, speculative decoding, Medusa

KV-cache makes serving possible. Prefix caching makes it cheap. Speculative decoding makes it fast. These three techniques stack independently and compound.

Advanced RAG and evals

Your RAG demo works because your eval questions were written by looking at the documents. Production breaks it. This chapter is the gap between demo and deployment.

MCP and A2A — the agent protocols

MCP is the USB-C of tool use. A2A is the protocol above it — how agents from different organizations discover and delegate to each other. Both are replacing bespoke OpenAPI wrappers.

LangGraph and orchestration that survives

Most agent frameworks are wrappers around a loop. LangGraph is a typed state machine with checkpointing, time-travel debugging, and explicit interrupt semantics. That is the difference between a demo and a production system.

Observability — traces, evals, regression

You cannot ship what you cannot measure. Trace-level inspection, labeled eval sets, and a regression gate on every prompt change are the difference between shipping confidently and deploying by prayer.

Red-teaming, jailbreaks, prompt injection

Your RAG pipeline is a vulnerability surface. Your MCP server is an attack vector. Your humanoid is a physical threat model. This chapter is the adversarial layer most teams skip until it is too late.

§ Part III

Embodied Intelligence

Twelve chapters from classical robotics to the dual-system humanoid foundation models — kinematics, CV, 3D perception, RL, sim-to-real, diffusion policies, RT-2, OpenVLA, π0/π0.5, SmolVLA, GR00T N1.5, LeRobot, embodied safety, capstone.

Classical robotics in one chapter

Every learned policy on a real robot is still riding on PD controllers and trajectory trackers. This chapter gives you the classical substrate you need to debug anything in the humanoid stack.

Computer vision foundations — CS231n, redux

ResNet and ViT are inside almost every robotics perception stack. CLIP, SigLIP, and DINOv2 are the vision encoders inside RT-2, OpenVLA, GR00T, and π0. This chapter is not a survey — it is the exact subset you need for embodied AI.

3D perception — NeRF, Gaussian Splatting, SAM 2

3D Gaussian Splatting gives your robot an explicit, editable 3D scene representation from RGB cameras in real time. SAM 2 gives per-object video segmentation with a memory module. Together they are the scene-understanding layer for the JHU humanoid.

Reinforcement learning, refreshed

You need enough RL fluency to read a 2025 robot-RL paper and identify what they actually changed — not to be an RL researcher. This chapter covers exactly that subset: MDPs, PPO, SAC, and the three places modern robot RL diverges from the textbook.

Sim-to-real and the Isaac stack

Sim-to-real is the central practical problem of robot learning. Isaac Lab + Newton physics engine is the open NVIDIA-DeepMind-Disney stack. GR00T-Dreams generated 6,500 hours of synthetic data in 11 hours from a small real seed. This chapter is the simulation tier of the JHU humanoid capstone.

Diffusion policies and action chunking

Diffusion policies represent the action distribution as a denoising diffusion model conditioned on observations — capturing multimodal action distributions that Gaussian policies smear. Action chunking makes inference frequency feasible. π0 and π0.5 use flow matching on the same idea.

Vision-Language-Action models — RT-2 and OpenVLA

RT-2 proved that co-finetuning a VLM on robot trajectories yields emergent semantic generalization. OpenVLA reproduced and beat it at 7B parameters open-source. This chapter covers the VLA architecture that all subsequent models — π0, SmolVLA, GR00T — build on.

π0, π0.5, and SmolVLA

π0 introduced flow-matching action experts on top of VLMs. π0.5 generalized to entirely new homes for 10-15 minute tasks. SmolVLA-450M runs on a MacBook and achieves 78.3% real-world success on community LeRobot datasets. This is the 2025 frontier for the JHU capstone.

NVIDIA GR00T N1 and N1.5 with Isaac Lab

GR00T N1 introduced the open humanoid foundation model with a System 1 / System 2 dual architecture. N1.5 (June 2025) raised RoboCasa success from 17.4% to 47.5% with a simplified adapter, FLARE loss, and DreamGen synthetic data. This is the high-capability candidate for the JHU humanoid.

Hugging Face LeRobot and the open robotics stack

LeRobot is the PyTorch of robotics: standardized datasets, a unified policy API (ACT, Diffusion Policy, SmolVLA, OpenVLA), and the SO-101 affordable arm at $110-150 BOM. The lerobot-record and lerobot-train CLIs are the production loop. This is the fastest path to real VLA data.

Safety and alignment for embodied AI

Embodied AI has three threat surfaces that text AI never faced: physical irreversibility, partial observability of human intent, and visual prompt injection. Hard-coded safety filters, not aligned policies, are the engineering reality in 2026.

Capstone — a humanoid home assistant, end to end

The JHU humanoid home-assistant capstone: DINOv2 + SAM 2 + 3DGS scene memory, LangGraph planner with MCP tools, SmolVLA or GR00T N1.5 policy, Isaac Lab simulation, a five-layer safety stack, and rerun.io observability. Not a demo — six months of work.

A note before you begin

This is not an introduction to AI. It assumes you already build with RAG, LangChain/LangGraph, LoRA/QLoRA, basic agents, and vector databases — and that you want to stop being a wrapper-engineer and start being someone who can read a CS336 lecture, a GR00T N1.5 model card, and a π0.5 paper without bluffing. Read the parts in order for the full ladder, or jump to whichever part is currently blocking you.

Begin Chapter 01 — Linear algebra you actually use →