Chapter 28 · Reinforcement learning, refreshed — From Tokens to Embodied Minds

You do not need to be an RL researcher to work on robots in 2026. You need enough fluency to read a robot RL paper and identify what they actually changed. That gap is smaller than you think: Markov decision processes, policy gradients, PPO (Schulman et al., arXiv:1707.06347, July 20, 2017), and SAC (Haarnoja et al., arXiv:1801.01290, Jan 4, 2018) cover most of the algorithmic territory. The 2025 papers changed reward shaping, curriculum design, and real-to-sim bootstrapping — not the core algorithms. Berkeley CS 285 (Sergey Levine, Fall 2024) is the deepest treatment; OpenAI Spinning Up (2018) is the fastest entry. This chapter extracts the subset that is directly relevant to the humanoid capstone and to reading GR00T N1.5 and π0.5 without confusion.

The MDP and policy gradients

A Markov decision process is the tuple (S, A, P, R, gamma): state space S, action space A, transition dynamics P(s'|s,a), reward function R(s,a,s'), and discount factor gamma. A policy pi(a|s) maps states to action distributions. The goal is to find the policy that maximizes expected discounted return. Policy gradient methods (REINFORCE, actor-critic) optimize this directly by estimating the gradient of expected return with respect to policy parameters — the score function estimator. The variance of this estimator is the central problem: too high, and training is unstable.

PPO (Proximal Policy Optimization, Schulman et al., July 2017) controls variance via a clipped objective: the ratio of new-to-old policy probabilities (r_t = pi_new(a|s) / pi_old(a|s)) is clipped to [1-epsilon, 1+epsilon] before being multiplied by the advantage estimate. This prevents large policy updates that could destabilize training — the key insight is that you want to improve the policy but not so much that the old data (collected under pi_old) becomes a bad estimator of the new policy's value. PPO is on-policy: you collect data, update the policy, discard the data, repeat.

SAC and the off-policy alternative

SAC (Soft Actor-Critic, Haarnoja et al., Jan 2018) adds entropy regularization to the objective: maximize expected return plus alpha * H(pi), where H(pi) is the entropy of the policy and alpha is a temperature parameter. This encourages exploration and prevents premature convergence to deterministic policies in multimodal action spaces. SAC is off-policy: it uses a replay buffer, allowing data from many past policy versions to be reused — making it far more sample-efficient than PPO. The trade-off is complexity: tuning alpha (or using automatic entropy tuning) and stabilizing the twin Q-networks adds engineering surface area.

For robot learning: PPO dominates locomotion (Isaac Lab provides unlimited simulator throughput; stability matters more than sample efficiency). SAC dominates real-robot manipulation (simulator is imperfect or unavailable; every rollout costs real-world time). OpenVLA and SmolVLA fine-tuning often starts from behavioral cloning (imitation learning) rather than RL, then adds online RL (SAC or PPO) for refinement — a pattern that appears in π0.5 (Physical Intelligence, arXiv:2504.16054, April 22, 2025).

What 2025 robot RL papers actually changed

Reading 2025 robot RL papers, the contributions cluster in three places that are not the core algorithm. First, reward shaping: dense reward signals that decompose the task (proximity reward + contact reward + grasp success reward) replace sparse binary rewards, dramatically accelerating learning. Second, curriculum design: starting in simulation with easy task variants (short reach distance, large target) and progressively hardening them — a pattern used by NVIDIA's GR00T training pipeline and Agility Robotics' Digit locomotion work. Third, real-to-sim bootstrapping: collecting a small real dataset, building a calibrated simulation from it, training extensively in sim, and transferring back — the loop that GR00T-Dreams closes by generating 6,500 hours of synthetic data in 11 hours from real-seed demonstrations.

Berkeley CS 285 (Sergey Levine, Fall 2024) is the authoritative course. Sutton and Barto's Reinforcement Learning: An Introduction (MIT Press, 2nd ed., 2018) is the textbook. The specific papers to read for robot RL in 2025 are the Isaac Lab documentation (NVIDIA), the GR00T N1 and N1.5 tech reports, and the π0.5 paper — all of which assume PPO/SAC fluency.

Capstone connections

For the JHU humanoid capstone, RL is used in two places. First, sim-to-real training in Isaac Lab (Chapter 29): PPO trains the low-level locomotion and manipulation controller across a distribution of randomized physics parameters. Second, post-training of SmolVLA or GR00T N1.5: if behavioral cloning from 200 demonstration episodes is insufficient for a household task, adding online RL (SAC or GRPO) can close the gap. The RL fluency you build here is the prerequisite for both.

On model-based RL

Model-based RL (Dreamer, MBPO) is not covered here because the 2025 humanoid stack uses model-free RL with a simulator rather than learned world models. The simulator is the model. Check back when GR00T-Dreams matures.

Figure 28.1PPO vs SAC in robot RL. PPO dominates simulation-based locomotion training (Isaac Lab); SAC dominates real-robot manipulation. The 2025 literature's contributions are primarily in reward shaping, curriculum design, and real-to-sim bootstrapping — not in the core algorithms.

Retrieve before you continue

Three questions on what you just read

Q1 Factual Write the PPO clipped objective and explain why the clip is necessary.

Q2 Conceptual Why does SAC outperform PPO on sample efficiency and when does that advantage matter for robotics?

Q3 Synthetic A 2025 robot RL paper reports a 3x improvement in task success. What three non-algorithmic changes should you check before attributing it to the RL algorithm?