Saral Shiksha Yojna
Courses/Computer Vision

Computer Vision

CSE471
Prof. Makarand Tapaswi + Prof. Charu SharmaSpring 2025-264 credits

Formulas & Diagrams

High-ROI section — formulas improve marks, diagrams improve recall.

Formulas

Convolution output size
Spatial output dim for a conv with kernel F, padding P, stride S on an input of width W.
Intersection-over-Union (IoU)
Box overlap metric. Detection TP threshold typically 0.5.
GIoU loss
Bounded in [−1, 1]; gives non-zero gradient even when boxes don't overlap (C = smallest enclosing box).
Dice coefficient
Segmentation overlap. Note denominator is SUM, not union.
Focal loss
γ ≈ 2. Down-weights well-classified examples to combat foreground/background imbalance.
Scaled dot-product attention
Core of every Transformer. √dₖ keeps softmax in its linear regime.
Multi-head attention
H parallel attentions in lower-dim subspaces; learn diverse relationships.
InfoNCE / NT-Xent
Contrastive loss. Cross-entropy with positive logit in numerator.
SigLIP pairwise sigmoid loss
Independent binary classification per pair; scales without batch synchronization.
PSNR
R = max pixel value. Higher = better. ≥ 30 dB → good reconstruction.
RoPE pair rotation
Rotates (Q,K) by angle m·θᵢ. Dot product depends only on relative position m − n.
RMSNorm
No mean subtraction; cheaper than LayerNorm; rescales onto a sphere.
LayerNorm
Per-token across feature dim. Independent of batch size.
3DGS alpha compositing
Front-to-back blending of sorted 2D-projected Gaussians.
3DGS covariance parameterization
Decomposition guarantees PSD; R from unit quaternion, S from log-space scales.
SMPL forward model
Template + shape blendshapes + pose blendshapes + linear blend skinning. β ∈ ℝ¹⁰, θ ∈ ℝ⁷².
PCK@α
Pose correctness. PCKh@0.5: d_ref = head bone length.
YOLO loss (sum of 5)
Center, size (√ to balance small/large), object conf, no-obj conf (λ=0.5), class. λ_coord = 5.
Triplet loss
Anchor / Positive / Negative with margin α. Used in metric learning.

Diagrams

R-CNN → Fast → Faster R-CNN evolution
Side-by-side block diagrams: per-region CNN forward (R-CNN); single CNN forward + RoI pooling (Fast); shared backbone + RPN + RoI head (Faster). Annotate the bottleneck in each.
[ diagram placeholder ]
YOLO grid output tensor
S × S grid overlaid on an image; each cell predicts B boxes (x,y,w,h,conf) + C class probs → S × S × (B·5 + C) tensor.
[ diagram placeholder ]
FCN-32s vs FCN-8s skip fusion
Encoder downsamples 32×; decoder upsamples. FCN-8s adds pool3 + pool4 skip connections fused with deep features.
[ diagram placeholder ]
U-Net symmetric encoder/decoder
Contracting path on left, expanding path on right; horizontal concat skip connections at every resolution.
[ diagram placeholder ]
OpenPose: heatmaps + PAFs
Network outputs K keypoint heatmaps and 2L PAF channels (x,y components per limb). Bipartite matching groups keypoints.
[ diagram placeholder ]
SMPL pipeline
Template mesh → shape blendshapes (β) → pose blendshapes + skinning (θ) → posed 3D mesh (6890 vertices).
[ diagram placeholder ]
PointNet architecture
Shared MLP per point (N × D → N × F) → symmetric max-pool over N → global feature → final MLP. T-Net for input + feature transform.
[ diagram placeholder ]
3D Gaussian Splatting pipeline
Images → COLMAP → sparse point cloud + camera poses → init Gaussians → render → image-space loss → backprop → adaptive density control (clone/split/prune).
[ diagram placeholder ]
Transformer block (PreNorm)
x → LN → MHA → +residual → LN → FFN(D → 4D → D) → +residual. Modern variant; unbroken residual stream.
[ diagram placeholder ]
ViT pipeline
Image → P×P patches → linear projection → +[CLS] → +position embeddings → L encoder layers → CLS output → MLP head → class logits.
[ diagram placeholder ]
Swin shifted-window attention
M×M local windows; alternating layers shift window grid by M/2 so tokens at window edges land in window interior next layer.
[ diagram placeholder ]
DINO self-distillation
Student + EMA teacher with identical architecture; multi-crop (2 global → teacher; 6-10 local → student). Centering + sharpening on teacher output.
[ diagram placeholder ]
MAE asymmetric encoder/decoder
75% patches masked; encoder sees only visible 25%; small decoder reconstructs masked pixels using mask tokens at the right positions.
[ diagram placeholder ]
PaliGemma three pillars
SigLIP (frozen) → linear connector (random init) → Gemma decoder. Prefix-LM mask: image+prompt bidirectional, suffix causal.
[ diagram placeholder ]
SlowFast dual pathway
Slow pathway: low fps, high channels (semantics). Fast pathway: high fps, low channels (motion). Lateral connections fuse.
[ diagram placeholder ]