From Tokens to Embodied Minds · Drill cards · Chapter 26
Drills
Computer vision foundations — CS231n, redux
10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.
In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Tokens to Embodied Minds · Ch 26, note type = Basic.
| Front | Back |
|---|---|
| What architectural innovation did ResNet introduce? | Skip connections: the input x is added directly to the block output F(x), so the network learns F(x) = H(x) - x (the residual). This prevents vanishing gradients through deep stacks. |
| How does ViT represent an image? | The image is split into fixed-size patches (e.g., 16x16 or 14x14 pixels), each linearly projected to a token embedding, then processed by a standard transformer encoder. |
| What is CLIP's pretraining objective? | Contrastive loss on 400M image-text pairs: embeddings of matched pairs are pulled together, mismatched pairs are pushed apart. This aligns vision and language feature spaces. |
| What does SigLIP change vs CLIP? | SigLIP replaces the softmax contrastive loss with a sigmoid loss on individual image-text pairs, eliminating the need for large-batch normalization and reducing training cost. |
| What pretraining data and objective does DINOv2 use? | 142M curated images (LVD-142M), self-supervised learning with masked image modeling and self-distillation from a teacher ViT. No language data. |
| Why is DINOv2 preferred over CLIP for object localization in manipulation? | DINOv2's dense self-supervised features provide patch-level spatial precision for object boundaries and depth cues. CLIP's global contrastive objective optimizes for image-level semantics, not fine-grained spatial queries. |
| Which vision encoders does OpenVLA fuse? | DINOv2 ViT-B/14 (dense spatial) and SigLIP ViT-So400M (language-aligned), concatenated before the Llama 2 backbone. |
| What is the parameter count of OpenVLA vs RT-2-X, and how does task success compare? | OpenVLA: 7B parameters. RT-2-X: 55B parameters. OpenVLA beats RT-2-X by 16.5 percentage points absolute task success across 29 tasks. |
| Which vision encoder does GR00T N1.5 use? | Eagle 2.5 as the frozen VLM backbone (NVIDIA Research, June 11, 2025). Eagle 2.5 is a language-aligned vision encoder, serving the same role as SigLIP in OpenVLA. |
| What is a linear probe and when does it fail? | A linear probe trains only a linear classifier on top of frozen pretrained features. It fails (relative to full fine-tune) when the pretrained feature space does not separate the target classes well — common when domain shift is large (e.g., web-pretrained features on robot-camera household objects). |