Predict before you read

Before you read — SmolVLA's community pretraining on LeRobot Hub datasets achieved 78.3% real-world success. What property of the LeRobot dataset format makes this possible?

Think about what makes it possible to aggregate datasets from hundreds of different contributors into a single pretraining corpus.

From Tokens to Embodied Minds  ·  Chapter 34 of 36
Chapter 34

Hugging Face LeRobot and the open robotics stack

The community-grade equivalent of Isaac Lab

$110–150
SO-101 arm BOM — fully open hardware spec, the lowest-cost path to real robot manipulation data
lerobot-train
CLI command — unified training loop for ACT, Diffusion Policy, SmolVLA, OpenVLA on any LeRobot dataset
HF Hub
every LeRobot dataset is versioned, queryable, parquet-backed — the data discipline that makes community pretraining possible
Maturity ladder

LeRobot (Hugging Face) is to robotics what PyTorch is to deep learning: a standardized stack that makes it possible for one person to go from zero to a fine-tuned VLA policy on real hardware in a weekend. The components: standardized datasets on HuggingFace Hub (every dataset a versioned, queryable, parquet-backed object), a unified policy API (ACT, Diffusion Policy, π0, SmolVLA, and OpenVLA all interchangeable via the same CLI), and the SO-100/SO-101 arm (fully open hardware spec, $110-150 bill of materials, servo-driven, six degrees of freedom). The lerobot-record and lerobot-train commands are the production loop. This is the open-source counterweight to the closed industrial stacks (Isaac Lab for simulation, GR00T for the policy) — and the fastest path for one person to collect real manipulation data, train a VLA, and deploy it on hardware they built. For the JHU humanoid capstone, LeRobot is the data infrastructure regardless of which policy you use.

The LeRobot dataset format

A LeRobot dataset is a HuggingFace Hub repository containing: one parquet file per episode (observation frames, action sequences, task metadata), a dataset card with hardware and task description, and a standardized schema that specifies observation keys (observation.images.top, observation.images.wrist, observation.state) and action keys (action as joint positions or end-effector deltas). The schema is flexible enough to accommodate different hardware (SO-101, WidowX, Franka, custom arms) while enforcing enough structure that datasets from different sources can be used together.

SmolVLA's community pretraining is the proof: Hugging Face aggregated hundreds of contributor datasets from HF Hub, all in the LeRobot format, into a single pretraining corpus. A researcher in Tokyo who uploads a 30-episode SO-101 plate-stacking dataset contributes to SmolVLA's pretrained capabilities globally. The versioning (every dataset is immutable once uploaded, with semantic versioning for updates) makes this reproducible — you can always trace which exact dataset version was used in any training run.

The unified policy API

LeRobot's policy API makes ACT, Diffusion Policy, SmolVLA, and OpenVLA interchangeable: lerobot-train --policy smolvla vs lerobot-train --policy act vs lerobot-train --policy diffusion. The same dataset, the same evaluation loop, the same CLI. This is significant for the capstone: you can run a principled ablation over policies on identical data — 50 episodes, 10 evaluation trials — without writing any training code. The comparison that matters for the capstone decision is ACT (deterministic transformer, fast inference) vs Diffusion Policy (multimodal, slower inference) vs SmolVLA (VLM-conditioned, language-instructable). SmolVLA will typically win on tasks that require language instructions or generalization to new objects; ACT will often win on precision motor tasks with consistent demonstrations.

The SO-100/SO-101 arm hardware: six servo-driven joints, open Fusion 360 CAD files, 3D-printable components, $110-150 bill of materials. The teleoperation controller (a leader arm or a SpaceMouse) connects via USB; lerobot-record handles the data collection, time-synchronization, and HF Hub upload. Building one takes 8-12 hours; buying a kit is available from several vendors in the LeRobot community.

Data discipline — the underrated part

The most important thing LeRobot teaches is data discipline. The temptation when recording manipulation demos is to collect 'enough' episodes without measuring quality. LeRobot's dataset card forces you to document: hardware, task description, success definition, failure modes, and per-episode success labels. The per-episode success labels are critical — training a policy on failed demonstrations (where the human recovered from a mistake) is a common source of policy confusion. The lerobot-record CLI includes a real-time visualization that lets you label each episode before uploading.

For the JHU humanoid capstone, the LeRobot dataset of 200 SO-101 manipulation episodes is one of the public output artifacts. It should be uploaded to HF Hub under your username, documented with a detailed dataset card, and verified against the LeRobot schema validator before use in training. This dataset is both a training artifact and a public portfolio piece — the data quality discipline is visible to anyone who downloads and examines it.

On SO-101 vs a humanoid platform

The SO-101 is a tabletop arm, not a humanoid. For the full JHU capstone (mobile humanoid), you will need to adapt the LeRobot data collection pipeline to your specific hardware. The dataset format and training CLI transfer directly; the hardware interface requires writing a custom LeRobot robot class.

LeRobot Production LoopSO-101 Arm$110–150 BOMOpen hardwarelerobot-recordTeleoperate + label50-200 episodesHF Hub DatasetParquet, versionedStandard schemaCommunity aggregationlerobot-train--policy act / diffusion / smolvlaSame CLI, same eval loopInterchangeable policiesPolicy comparison on identical 50-episode dataACTFast inferencePrecision tasksNo language groundingDiffusion PolicyMultimodal dist.Slower inferenceNo language groundingSmolVLALanguage-instructed78.3% pretrainedConsumer GPU
Figure 34.1The LeRobot production loop: SO-101 hardware → lerobot-record (teleoperate + label) → HF Hub (versioned parquet dataset) → lerobot-train (ACT, Diffusion Policy, or SmolVLA via --policy flag). All three policies train and evaluate on the same dataset with the same CLI. SmolVLA wins on language-instructed generalization; ACT wins on precision motor tasks.
Retrieve before you continue

Three questions on what you just read

Q1 Factual What are the three standard observation keys in the LeRobot dataset schema?
Q2 Conceptual Why does data standardization on HF Hub enable community pretraining for SmolVLA?
Q3 Synthetic You train ACT, Diffusion Policy, and SmolVLA on the same 50-episode SO-101 dataset. SmolVLA has the lowest success rate after 5 epochs of training. What are the two most likely causes?