Courses/Computer Vision

Computer Vision

CSE471

Prof. Makarand Tapaswi + Prof. Charu Sharma•Spring 2025-26•4 credits

Formulas & Diagrams

High-ROI section — formulas improve marks, diagrams improve recall.

Formulas

Convolution output size

—

O = ⌊(W - F + 2 P) / S ⌋ + 1

Spatial output dim for a conv with kernel F, padding P, stride S on an input of width W.

Intersection-over-Union (IoU)

—

IoU (A, B) = \frac{∣ A \cap B ∣}{∣ A \cup B ∣}

Box overlap metric. Detection TP threshold typically 0.5.

GIoU loss

—

GIoU = IoU - \frac{∣ C ∖ ( A \cup B ) ∣}{∣ C ∣}

Bounded in [−1, 1]; gives non-zero gradient even when boxes don't overlap (C = smallest enclosing box).

Dice coefficient

—

Dice = \frac{2∣ A \cap B ∣}{∣ A ∣ + ∣ B ∣} = \frac{2 IoU}{1 + IoU}

Segmentation overlap. Note denominator is SUM, not union.

Focal loss

—

F L (p_{t}) = - (1 - p_{t})^{γ} lo g p_{t}

γ ≈ 2. Down-weights well-classified examples to combat foreground/background imbalance.

Scaled dot-product attention

—

Attn (Q, K, V) = softmax (\frac{Q K ^{⊤}}{d _{k}}) V

Core of every Transformer. √dₖ keeps softmax in its linear regime.

Multi-head attention

—

MHA (Q, K, V) = Concat (h_{1}, \dots, h_{H}) W^{O}, h_{i} = Attn (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})

H parallel attentions in lower-dim subspaces; learn diverse relationships.

InfoNCE / NT-Xent

—

L_{i} = - lo g \frac{exp ( sim ( z _{i} , z _{i}^{+} ) / τ )}{\sum _{j} exp ( sim ( z _{i} , z _{j} ) / τ )}

Contrastive loss. Cross-entropy with positive logit in numerator.

SigLIP pairwise sigmoid loss

—

L = - \frac{1}{n ^{2}} i, j \sum [y_{ij} lo g σ (z_{ij}) + (1 - y_{ij}) lo g (1 - σ (z_{ij}))]

Independent binary classification per pair; scales without batch synchronization.

PSNR

—

PSNR = 10 lo g_{10} (\frac{R ^{2}}{MSE})

R = max pixel value. Higher = better. ≥ 30 dB → good reconstruction.

RoPE pair rotation

—

[q_{2 i}^{'} q_{2 i + 1}^{'}] = [cos m θ_{i} sin m θ_{i} - sin m θ_{i} cos m θ_{i}] [q_{2 i} q_{2 i + 1}]

Rotates (Q,K) by angle m·θᵢ. Dot product depends only on relative position m − n.

RMSNorm

—

y = γ \cdot \frac{x}{\frac{1}{d} \sum _{i} x _{i}^{2}}

No mean subtraction; cheaper than LayerNorm; rescales onto a sphere.

LayerNorm

—

y = γ \cdot \frac{x - μ}{σ} + β

Per-token across feature dim. Independent of batch size.

3DGS alpha compositing

—

C = i \sum c_{i} α_{i} j < i \prod (1 - α_{j})

Front-to-back blending of sorted 2D-projected Gaussians.

3DGS covariance parameterization

—

Σ = R S S^{⊤} R^{⊤}

Decomposition guarantees PSD; R from unit quaternion, S from log-space scales.

SMPL forward model

—

M (β, θ) = W (T_{P} (β, θ), J (β), θ, W)

Template + shape blendshapes + pose blendshapes + linear blend skinning. β ∈ ℝ¹⁰, θ ∈ ℝ⁷².

PCK@α

—

PCK@ α = \frac{1}{N} i \sum 1 [∥ \overset{p}{^}_{i} - p_{i} ∥ \leq α \cdot d_{ref}]

Pose correctness. PCKh@0.5: d_ref = head bone length.

YOLO loss (sum of 5)

—

L = λ_{coord} ij \sum obj [(x - \overset{x}{^})^{2} + (y - \overset{y}{^})^{2}] + λ_{coord} ij \sum obj [(w - \overset{w}{^})^{2} + (h - \hat{h})^{2}] + ij \sum obj (C - \hat{C})^{2} + λ_{noobj} ij \sum noobj (C - \hat{C})^{2} + i \sum obj c \sum (p (c) - \overset{p}{^} (c))^{2}

Center, size (√ to balance small/large), object conf, no-obj conf (λ=0.5), class. λ_coord = 5.

Triplet loss

—

L = max (0, ∥ A - P ∥^{2} - ∥ A - N ∥^{2} + α)

Anchor / Positive / Negative with margin α. Used in metric learning.

Diagrams

R-CNN → Fast → Faster R-CNN evolution

Side-by-side block diagrams: per-region CNN forward (R-CNN); single CNN forward + RoI pooling (Fast); shared backbone + RPN + RoI head (Faster). Annotate the bottleneck in each.

[ diagram placeholder ]

YOLO grid output tensor

S × S grid overlaid on an image; each cell predicts B boxes (x,y,w,h,conf) + C class probs → S × S × (B·5 + C) tensor.

[ diagram placeholder ]

FCN-32s vs FCN-8s skip fusion

Encoder downsamples 32×; decoder upsamples. FCN-8s adds pool3 + pool4 skip connections fused with deep features.

[ diagram placeholder ]

U-Net symmetric encoder/decoder

Contracting path on left, expanding path on right; horizontal concat skip connections at every resolution.

[ diagram placeholder ]

OpenPose: heatmaps + PAFs

Network outputs K keypoint heatmaps and 2L PAF channels (x,y components per limb). Bipartite matching groups keypoints.

[ diagram placeholder ]

SMPL pipeline

Template mesh → shape blendshapes (β) → pose blendshapes + skinning (θ) → posed 3D mesh (6890 vertices).

[ diagram placeholder ]

PointNet architecture

Shared MLP per point (N × D → N × F) → symmetric max-pool over N → global feature → final MLP. T-Net for input + feature transform.

[ diagram placeholder ]

3D Gaussian Splatting pipeline

Images → COLMAP → sparse point cloud + camera poses → init Gaussians → render → image-space loss → backprop → adaptive density control (clone/split/prune).

[ diagram placeholder ]

Transformer block (PreNorm)

x → LN → MHA → +residual → LN → FFN(D → 4D → D) → +residual. Modern variant; unbroken residual stream.

[ diagram placeholder ]

ViT pipeline

Image → P×P patches → linear projection → +[CLS] → +position embeddings → L encoder layers → CLS output → MLP head → class logits.

[ diagram placeholder ]

Swin shifted-window attention

M×M local windows; alternating layers shift window grid by M/2 so tokens at window edges land in window interior next layer.

[ diagram placeholder ]

DINO self-distillation

Student + EMA teacher with identical architecture; multi-crop (2 global → teacher; 6-10 local → student). Centering + sharpening on teacher output.

[ diagram placeholder ]

MAE asymmetric encoder/decoder

75% patches masked; encoder sees only visible 25%; small decoder reconstructs masked pixels using mask tokens at the right positions.

[ diagram placeholder ]

PaliGemma three pillars

SigLIP (frozen) → linear connector (random init) → Gemma decoder. Prefix-LM mask: image+prompt bidirectional, suffix causal.

[ diagram placeholder ]

SlowFast dual pathway

Slow pathway: low fps, high channels (semantics). Fast pathway: high fps, low channels (motion). Lateral connections fuse.

[ diagram placeholder ]