Chapter 26 · Computer vision foundations — CS231n, redux

ResNet (He et al., arXiv:1512.03385, Dec 10, 2015) and ViT (Dosovitskiy et al., arXiv:2010.11929, Oct 22, 2020) are still inside almost every robotics perception stack. The year is 2026 and this is not nostalgia — it is architecture path dependence. The vision encoders inside RT-2, OpenVLA, GR00T N1.5, and π0 all trace back to one of these two families. What changed is the pretraining: CLIP (Radford et al., arXiv:2103.00020, Feb 26, 2021), SigLIP (Zhai et al., arXiv:2303.15343, March 27, 2023), and DINOv2 (Oquab et al., arXiv:2304.07193, April 14, 2023) turned the architecture into a foundation model. This chapter is not a survey of computer vision. It is the exact subset of Stanford CS231n that matters for embodied AI in 2026: skip connections and patch embeddings, contrastive and self-supervised pretraining, and the specific fusion choices that make OpenVLA beat RT-2-X 55B by 16.5 percentage points with a 7× smaller model.

ResNet and ViT: two architectures that survived

ResNet (He et al., Dec 2015) solved the vanishing-gradient problem with a single idea: skip connections that add the input of a block directly to its output. This lets gradients flow unobstructed through hundreds of layers and allows very deep networks to be trained stably. The residual formulation F(x) + x means the network only needs to learn the residual F(x) rather than the full mapping — a simpler optimization problem. ResNets became the backbone of choice for convolutional feature extraction and remain embedded in many robotics pipelines via pretrained weights from ImageNet.

ViT (Dosovitskiy et al., Oct 2020) replaced convolutions entirely with a transformer applied to a sequence of fixed-size image patches. An image is split into NxN patches, each linearly projected to a token embedding, and processed by a standard transformer encoder. The key insight: attention heads learn long-range spatial dependencies that convolutions can only approximate through many stacked layers. ViT-B uses 16x16 patches; ViT-B/14 (used by DINOv2) uses 14x14 patches for finer spatial resolution. At scale and with the right pretraining, ViT dominates ResNet on every benchmark that matters for robotics.

CLIP, SigLIP, DINOv2 — three pretraining philosophies

CLIP (Radford et al., Feb 2021) trains a vision encoder and text encoder jointly with contrastive loss on 400M image-text pairs. The result: a vision model whose feature space is aligned with language, enabling zero-shot classification and language-conditioned retrieval. The limitation: the contrastive objective optimizes for global image-text alignment, not for dense spatial features. CLIP features are excellent for semantic grounding ("find the mug") but weak for fine-grained localization ("find the handle of the mug").

SigLIP (Zhai et al., March 2023) replaces the softmax contrastive loss with a sigmoid loss applied to individual image-text pairs — eliminating the normalization across the batch that makes large-batch CLIP training expensive. SigLIP-So400M (400M-parameter ViT trained on SigLIP loss) achieves better image-language alignment than CLIP ViT-L at lower training cost. It is the language-aligned encoder in OpenVLA and π0. DINOv2 (Oquab et al., April 2023) takes the opposite approach: no language, pure self-supervised learning on curated image data (142M images, LVD-142M). The pretraining objective distills knowledge from a teacher ViT into a student ViT using masked image modeling and self-distillation. The result is a dense, spatially precise feature extractor with no language alignment — ideal for object localization and depth estimation, weak for instruction following.

Why OpenVLA fuses both

OpenVLA (Kim et al., arXiv:2406.09246, June 13, 2024) concatenates DINOv2 ViT-B/14 features and SigLIP ViT-So400M features before passing them to the Llama 2 backbone. The intuition is direct: DINOv2 provides dense spatial precision (object boundaries, depth cues, patch-level geometry) while SigLIP provides language alignment (the phrase "grasp the red cup" maps onto the correct region). Using both gives 16.5 percentage points absolute task-success improvement over the closed RT-2-X (55B parameters, proprietary) with only 7B parameters. GR00T N1.5 (NVIDIA Research, June 11, 2025) uses Eagle 2.5 as its frozen VLM backbone — a different architecture, same principle: a strong language-aligned vision encoder is non-negotiable.

Stanford CS231n (Spring 2024) is still the cleanest pedagogical treatment of the material up through CLIP. For DINOv2 and SigLIP, go directly to the papers — both are short and well-written. The build for this chapter (fine-tune DINOv2 ViT-B/14 on a 5-class household-object dataset, compare linear probe vs full finetune) will make the feature quality concrete.

Capstone wiring

DINOv2 ViT-B/14 is the perception backbone for object grounding in the JHU humanoid capstone. It feeds the SmolVLA or GR00T N1.5 policy as the visual token stream. The linear-probe gap you measure in the build (typically 10–15 percentage points) is the concrete justification for why full fine-tuning on your household-object dataset is worth the GPU hours. Every 1% of object recognition accuracy you recover here is 1% fewer task failures at the policy level.

On CLIP for robotics

CLIP's global contrastive objective means it struggles with fine-grained spatial queries like 'the hinge of the cabinet.' For manipulation, DINOv2's dense features matter more than CLIP's language alignment — which is why the fusion in OpenVLA is not redundant.

Figure 26.1Vision encoder landscape for embodied AI. ViT (Dosovitskiy et al., Oct 2020) is the architectural root. CLIP, SigLIP, and DINOv2 are three pretraining philosophies built on ViT. OpenVLA fuses DINOv2 (dense spatial) and SigLIP (language-aligned) before a Llama 2 backbone — yielding +16.5% task success over RT-2-X at 7× fewer parameters.

Retrieve before you continue

Three questions on what you just read

Q1 Factual What pretraining objective does DINOv2 use, and how does it differ from CLIP?

Q2 Conceptual Why does OpenVLA fuse DINOv2 and SigLIP rather than using one or the other?

Q3 Synthetic If you are building the JHU humanoid perception stack on a single consumer GPU, which vision encoder do you start with and what fine-tuning strategy?