Chapter 34 · Hugging Face LeRobot and the open robotics stack

LeRobot (Hugging Face) is to robotics what PyTorch is to deep learning: a standardized stack that makes it possible for one person to go from zero to a fine-tuned VLA policy on real hardware in a weekend. The components: standardized datasets on HuggingFace Hub (every dataset a versioned, queryable, parquet-backed object), a unified policy API (ACT, Diffusion Policy, π0, SmolVLA, and OpenVLA all interchangeable via the same CLI), and the SO-100/SO-101 arm (fully open hardware spec, $110-150 bill of materials, servo-driven, six degrees of freedom). The lerobot-record and lerobot-train commands are the production loop. This is the open-source counterweight to the closed industrial stacks (Isaac Lab for simulation, GR00T for the policy) — and the fastest path for one person to collect real manipulation data, train a VLA, and deploy it on hardware they built. For the JHU humanoid capstone, LeRobot is the data infrastructure regardless of which policy you use.

The LeRobot dataset format

A LeRobot dataset is a HuggingFace Hub repository containing: one parquet file per episode (observation frames, action sequences, task metadata), a dataset card with hardware and task description, and a standardized schema that specifies observation keys (observation.images.top, observation.images.wrist, observation.state) and action keys (action as joint positions or end-effector deltas). The schema is flexible enough to accommodate different hardware (SO-101, WidowX, Franka, custom arms) while enforcing enough structure that datasets from different sources can be used together.

SmolVLA's community pretraining is the proof: Hugging Face aggregated hundreds of contributor datasets from HF Hub, all in the LeRobot format, into a single pretraining corpus. A researcher in Tokyo who uploads a 30-episode SO-101 plate-stacking dataset contributes to SmolVLA's pretrained capabilities globally. The versioning (every dataset is immutable once uploaded, with semantic versioning for updates) makes this reproducible — you can always trace which exact dataset version was used in any training run.

The unified policy API

LeRobot's policy API makes ACT, Diffusion Policy, SmolVLA, and OpenVLA interchangeable: lerobot-train --policy smolvla vs lerobot-train --policy act vs lerobot-train --policy diffusion. The same dataset, the same evaluation loop, the same CLI. This is significant for the capstone: you can run a principled ablation over policies on identical data — 50 episodes, 10 evaluation trials — without writing any training code. The comparison that matters for the capstone decision is ACT (deterministic transformer, fast inference) vs Diffusion Policy (multimodal, slower inference) vs SmolVLA (VLM-conditioned, language-instructable). SmolVLA will typically win on tasks that require language instructions or generalization to new objects; ACT will often win on precision motor tasks with consistent demonstrations.

The SO-100/SO-101 arm hardware: six servo-driven joints, open Fusion 360 CAD files, 3D-printable components, $110-150 bill of materials. The teleoperation controller (a leader arm or a SpaceMouse) connects via USB; lerobot-record handles the data collection, time-synchronization, and HF Hub upload. Building one takes 8-12 hours; buying a kit is available from several vendors in the LeRobot community.

Data discipline — the underrated part

The most important thing LeRobot teaches is data discipline. The temptation when recording manipulation demos is to collect 'enough' episodes without measuring quality. LeRobot's dataset card forces you to document: hardware, task description, success definition, failure modes, and per-episode success labels. The per-episode success labels are critical — training a policy on failed demonstrations (where the human recovered from a mistake) is a common source of policy confusion. The lerobot-record CLI includes a real-time visualization that lets you label each episode before uploading.

For the JHU humanoid capstone, the LeRobot dataset of 200 SO-101 manipulation episodes is one of the public output artifacts. It should be uploaded to HF Hub under your username, documented with a detailed dataset card, and verified against the LeRobot schema validator before use in training. This dataset is both a training artifact and a public portfolio piece — the data quality discipline is visible to anyone who downloads and examines it.

On SO-101 vs a humanoid platform

The SO-101 is a tabletop arm, not a humanoid. For the full JHU capstone (mobile humanoid), you will need to adapt the LeRobot data collection pipeline to your specific hardware. The dataset format and training CLI transfer directly; the hardware interface requires writing a custom LeRobot robot class.

Figure 34.1The LeRobot production loop: SO-101 hardware → lerobot-record (teleoperate + label) → HF Hub (versioned parquet dataset) → lerobot-train (ACT, Diffusion Policy, or SmolVLA via --policy flag). All three policies train and evaluate on the same dataset with the same CLI. SmolVLA wins on language-instructed generalization; ACT wins on precision motor tasks.

Primary source · Build · Capstone ladder

Primary source. LeRobot documentation, Hugging Face (current); SmolVLA tutorial, huggingface.co/docs/lerobot/smolvla

Build. Build or buy an SO-101 arm. Record a 50-episode 'sort blocks by color' dataset with lerobot-record. Upload to HF Hub. Train ACT, Diffusion Policy, and SmolVLA on the same dataset using lerobot-train. Evaluate each on 10 test episodes. Report success rate, training time, and GPU memory for each policy. Upload the dataset to HF Hub under your username.

Capstone ladder. This is the production data stack for the JHU humanoid. The LeRobot 200-episode SO-101 dataset is one of the public output artifacts of the capstone. Even if your final hardware is a full humanoid, the data discipline (schema, success labels, HF Hub upload) transfers. The unified policy API lets you ablate ACT vs Diffusion Policy vs SmolVLA on identical data without writing training code.

Retrieve before you continue

Three questions on what you just read

Q1 Factual What are the three standard observation keys in the LeRobot dataset schema?

Q2 Conceptual Why does data standardization on HF Hub enable community pretraining for SmolVLA?

Q3 Synthetic You train ACT, Diffusion Policy, and SmolVLA on the same 50-episode SO-101 dataset. SmolVLA has the lowest success rate after 5 epochs of training. What are the two most likely causes?