What architectural innovation did ResNet introduce?	Skip connections: the input x is added directly to the block output F(x), so the network learns F(x) = H(x) - x (the residual). This prevents vanishing gradients through deep stacks.
How does ViT represent an image?	The image is split into fixed-size patches (e.g., 16x16 or 14x14 pixels), each linearly projected to a token embedding, then processed by a standard transformer encoder.
What is CLIP's pretraining objective?	Contrastive loss on 400M image-text pairs: embeddings of matched pairs are pulled together, mismatched pairs are pushed apart. This aligns vision and language feature spaces.
What does SigLIP change vs CLIP?	SigLIP replaces the softmax contrastive loss with a sigmoid loss on individual image-text pairs, eliminating the need for large-batch normalization and reducing training cost.
What pretraining data and objective does DINOv2 use?	142M curated images (LVD-142M), self-supervised learning with masked image modeling and self-distillation from a teacher ViT. No language data.
Why is DINOv2 preferred over CLIP for object localization in manipulation?	DINOv2's dense self-supervised features provide patch-level spatial precision for object boundaries and depth cues. CLIP's global contrastive objective optimizes for image-level semantics, not fine-grained spatial queries.
Which vision encoders does OpenVLA fuse?	DINOv2 ViT-B/14 (dense spatial) and SigLIP ViT-So400M (language-aligned), concatenated before the Llama 2 backbone.
What is the parameter count of OpenVLA vs RT-2-X, and how does task success compare?	OpenVLA: 7B parameters. RT-2-X: 55B parameters. OpenVLA beats RT-2-X by 16.5 percentage points absolute task success across 29 tasks.
Which vision encoder does GR00T N1.5 use?	Eagle 2.5 as the frozen VLM backbone (NVIDIA Research, June 11, 2025). Eagle 2.5 is a language-aligned vision encoder, serving the same role as SigLIP in OpenVLA.
What is a linear probe and when does it fail?	A linear probe trains only a linear classifier on top of frozen pretrained features. It fails (relative to full fine-tune) when the pretrained feature space does not separate the target classes well — common when domain shift is large (e.g., web-pretrained features on robot-camera household objects).
