Computer Vision
CSE471Prof. Makarand Tapaswi + Prof. Charu Sharma•Spring 2025-26•4 credits
Mock Paper 6 — Advanced SSL + 3DGS Math + VLMs + Modern Architectures
Duration: 180 min • Max marks: 100
Section A — Short Answer (1-2 marks each, 20 marks)
20 marks- 1.DINO EMA update θ_t ← λθ_t + (1−λ)θ_s. If λ = 0.99 and student changes by Δθ in one step, how much does the teacher change?2 m
- 2.Per-Gaussian parameter count in 3DGS if SH degree is 0 (no view-dependent colour)?1 m
- 3.CLIP image-text contrastive matrix is N×N. Write the loss and explain why it becomes softmax cross-entropy.2 m
- 4.What is KV caching in Transformer inference?1 m
- 5.PaliGemma uses 256 patches of 14×14. What is the image input size, and how does the linear projection work?2 m
- 6.What is OpenVLA's action discretisation scheme?1 m
- 7.Why does SigLIP's sigmoid loss enable larger batches than CLIP's softmax?2 m
- 8.In V-JEPA, what does masking spatio-temporal regions force the model to learn?1 m
- 9.Why does Stable Diffusion's U-Net operate in VAE latent space rather than pixel space?2 m
- 10.What is classifier-free guidance in conditional diffusion?1 m
- 11.In ViT LayerNorm: what dimension does it normalise, and where are γ, β applied?2 m
- 12.What is stochastic depth in ViT training?1 m
- 13.ViT-B/16 has 86 M params; ResNet-50 has 25 M; yet ResNet-50 outperforms ViT-B on small datasets. Why?2 m
- 14.What is the 'registers' trick in DINOv2?1 m
Section B — Conceptual / Explanation (4-6 marks each, 40 marks)
40 marks- 1.SimCLR's key empirical findings — name and justify four.5 m
- 2.Derive why DINO's centering and sharpening together prevent both mode collapse and uniform collapse.5 m
- 3.MAE in detail: (a) masking strategy, (b) asymmetric encoder-decoder, (c) why 75% mask, (d) why pixel reconstruction not feature reconstruction.6 m
- 4.3DGS splat-and-render math: how does a 3D Gaussian project to 2D and how does alpha compositing produce a pixel colour?5 m
- 5.Compare CLIP, DINO, MAE, JEPA on this specific question: which features transfer best for object detection on a new domain (medical imaging) with only 1000 labelled examples?4 m
- 6.Four failure modes of CLIP and architectural / training fixes for each.5 m
- 7.Mixture of Experts in vision Transformers — how does it work and why does it scale better than dense models?5 m
- 8.Sapiens (Meta, ECCV 2024) — what makes it a 'foundation model for humans', tasks, training.5 m
Section C — Long Form (10 marks each, 40 marks)
40 marks- 1.Attention mechanics. (a) Forward pass equations for multi-head self-attention on input X of shape (B, N, D) with H heads, d_k = D/H. (b) Numerical example: B=1, N=4, D=8, H=2, d_k=4 with W_Q = W_K = W_V = I; X has 4 rows: [1,0,1,0,1,0,1,0], [0,1,0,1,0,1,0,1], [1,1,1,1,0,0,0,0], [0,0,0,0,1,1,1,1]. Compute Q·Kᵀ and attention pattern after softmax for HEAD 1 (first 4 dims). (c) Semantic relationships observed.10 m
- 2.3DGS parameter accounting. (a) Memory for a 1.5 M-Gaussian scene in float32. (b) Pruning step of Adaptive Density Control. (c) Four compression strategies for shipping to a mobile device.10 m
- 3.Build a VLM from scratch for a mobile-app photo-QA scenario, < 1 s/query. (a) Choose vision encoder, LLM, connector; justify. (b) Compare with a Gemma 4 native-multimodal architecture. (c) Diagnose and mitigate hallucinations.10 m
- 4.Video search system: natural-language query retrieves clips from 100,000 hours of footage. (a) Indexing + search architecture. (b) Temporal queries like 'person opens door, then leaves'. (c) Evaluation metrics.10 m
Track your attempt locally — score and time are recorded in your browser. (Coming soon: timed-attempt mode.)