Saral Shiksha Yojna
Courses/Computer Vision

Computer Vision

CSE471
Prof. Makarand Tapaswi + Prof. Charu SharmaSpring 2025-264 credits

Definitions

Every term, every chapter. Toggle between the textbook wording and a plain-English version (when available).

219 terms · 0 have plain-English versions

Unit 1 — Introduction & Foundations

Foundations of Computer Vision — Marr, Three Rs, Gestalt, Why CV is Hard
Computer vision
The science of extracting all possible information about a visual scene from images — answering What/Where/Who/When/Why/How/How-many.
Marr's definition of vision
(1982) 'To know what is where, by looking.' The two-word mission statement of the field.
Three Rs (Malik)
Reorganisation (group pixels), Recognition (label them), Reconstruction (measure geometry). Modern CV systems use all three.
Semantic gap
The conceptual distance between low-level pixel intensities and high-level semantic concepts (objects, intent, action). Bridging it is the central technical challenge.
Intra-class variation
The amount of visual variability within a single semantic class (e.g., 'chair' includes throne, beanbag, office chair). Often larger than inter-class variation — which is why pure appearance-matching fails.
Affordance (Gibson)
A possibility for action that an object offers an agent (a chair affords sitting, a handle affords grasping). Provides a functional definition of object category that is more robust than appearance.
Gestalt principles
Pre-attentive grouping rules formulated by 1920s German psychologists: Proximity, Similarity, Closure, Continuation, Common Fate, Figure-Ground, Symmetry. Motivate classical segmentation and modern attention.
Inverse problem of vision
Image formation is many-to-one (3D scene → 2D image), so inversion (2D → 3D + identity) is one-to-many and underdetermined. Vision must use priors to choose a plausible interpretation.
The Bitter Lesson (Sutton)
Empirical observation that, across decades of AI research, methods that leverage raw computation eventually beat methods that bake in domain knowledge. Drives the 'just scale it up' approach in modern CV.
Cambrian explosion connection
Vision drove evolutionary divergence ~540 Myr ago. Hard evidence that visual perception is computationally expensive — evolution would not have selected for it if it were cheap.
Summer Vision Project (1966)
MIT memo by Seymour Papert proposing CV could be largely solved in one summer. The iconic underestimation of the field.
Inattentional blindness
Failure to notice obvious stimuli when attention is occupied elsewhere (Simons & Chabris's 'invisible gorilla'). A reminder that biological vision is selective, not a faithful camera.

Unit 2 — Digital Image Processing Recap

DIP — Filters, Histograms, Fourier/DCT, Morphology, Geometric Ops, Hough, Templates
Sampling vs quantisation
Sampling = discretising spatial position (which pixels exist). Quantisation = discretising intensity (how many gray levels per pixel).
Spatial domain
Operations on pixel values directly. Three flavours: Point→Point, Neighbourhood→Point, Global→Point.
Transform domain
Operations after transforming to another basis (Fourier/DCT/wavelet); useful for convolution, compression, periodic-noise removal.
Convolution theorem
Convolution in space ↔ multiplication in frequency. Justifies FFT-based filtering for large kernels.
Separable filter
A 2D filter — can be applied as two 1D passes. Gaussian is separable; Laplacian is not.
Bilateral filter
Edge-preserving smoothing. Weights = spatial Gaussian × range (intensity) Gaussian. Across an edge, intensity differs → range weight collapses → no blur across edge.
Otsu's method
Automatic threshold for bimodal histograms. Picks T that maximises between-class variance (equivalently minimises within-class). closed-form sweep over all intensity levels.
Morphological erosion / dilation
Erosion : SE fits inside A → MIN filter, shrinks. Dilation : SE touches A → MAX filter, grows. Duals.
Opening / Closing
Opening = erode then dilate (kills noise dots, preserves shape). Closing = dilate then erode (fills small holes). Both idempotent.
Hit-or-Miss transform (HAM)
Match a foreground pattern AND its background simultaneously. Used to locate isolated points, line endpoints, T-junctions, corners.
Affine vs projective
Affine (6 DoF) preserves parallel lines. Projective/homography (8 DoF) preserves only straight lines — allows perspective.
Forward vs inverse warping
Forward: iterate over source pixels, push to T(x,y) — produces holes + overlaps. Inverse: iterate over destination, pull from T⁻¹(x', y') — every dest filled exactly once + needs interpolation.
Hough transform
Voting in parameter space. Each image edge pixel votes for all lines through it; peaks in accumulator = detected lines. Polar form avoids the vertical-line infinity problem.
Template matching
Slide a template over the image; score similarity (SSD, NCC, correlation coefficient). Fast but NOT rotation/scale invariant — motivates feature descriptors.
Histogram equalisation
Map intensity via CDF: . Output histogram is approximately uniform; contrast maximised; can amplify noise in flat regions.
DCT vs DFT
DCT is real-valued (cosines only) and mirror-extends → no boundary discontinuity → better energy compaction. JPEG uses DCT; spectrum analysis uses DFT.

Unit 3 — Machine Learning Recap

ML — Logistic, NN+Backprop, Ensembles, Density, RNN, Metrics, kNN, Regression, PCA/SVD, Clustering
Sigmoid / softmax
Squash to / probability simplex. Sigmoid for binary, softmax for classes.
Cross-entropy loss
Information-theoretic distance between distributions. Categorical: . Gradient on logits = — clean and stable.
Backpropagation
Chain-rule traversal of a computation graph from loss back to parameters. Each node multiplies the local Jacobian.
Vanishing / exploding gradients
Gradient norm decays to 0 (or blows up) as it propagates back through many layers / time steps. Mitigations: careful init, ReLU/skip connections (forward), LSTM/clip (recurrent).
Batch Normalisation
Normalise activations per mini-batch to zero-mean unit-variance; then learnable scale and shift. Faster training, less init-sensitive, slight regularisation. At inference: use running mean/var, NOT batch stats.
Dropout
Stochastically zero fraction of activations during training. Acts as an ensemble + regulariser. Disable (or rescale) at inference.
Bagging vs Boosting
Bagging: parallel, bootstrap samples, reduces variance (Random Forest). Boosting: sequential, reweight errors, reduces bias (AdaBoost; Viola-Jones face detection).
Gaussian Mixture Model
. Trained by EM. Soft clustering with covariance-shaped components.
EM algorithm
E-step: compute responsibilities given current parameters. M-step: re-estimate parameters by weighted MLE. Monotonically improves log-likelihood; converges to a local maximum.
LSTM
Gated recurrent unit with three gates (forget, input, output) and an additive cell-state path that prevents vanishing gradients (constant error carousel).
Precision / Recall / F1
P = TP/(TP+FP), R = TP/(TP+FN), F1 = 2PR/(P+R) (harmonic mean). Use precision when FP costly; recall when FN costly.
AP / mAP
Average Precision = area under the PR curve for one class. mAP = mean over classes (or queries).
ROC vs PR curve
ROC: TPR vs FPR. PR: Precision vs Recall. Prefer PR for imbalanced data with rare positives.
kNN
Lazy learner; predict by majority vote among k nearest training points. Small k → variance; large k → bias. k usually odd.
Normal equations
Closed-form least squares: . Fails when is singular (collinearity) or too large to invert.
L1 vs L2 regularisation
L1 = sum of |w|; drives weights to exactly 0 → feature selection. L2 = sum of w²; shrinks smoothly, no exact zeros.
PCA
Find orthogonal directions of max variance. Steps: centre → covariance → eigendecompose → keep top-k. = variance along .
SVD
. Right singular vectors are PCA components. Numerically stable. Best rank- approximation in Frobenius norm (Eckart–Young).
LoRA
Low-Rank Adaptation: with , . Cheap fine-tuning of huge models. Same spirit as SVD.
k-means
Hard-assignment clustering. Init → assign → update → repeat. Issues: spherical bias, init-sensitive, outlier-sensitive. Fixes: k-means++, GMM, k-medoids.
Hierarchical clustering
Agglomerative (bottom-up) or divisive (top-down). Produces a dendrogram; cut at different heights for different cluster counts. Linkage: single / complete / average / Ward.

Unit 4 — Convolutional Neural Networks (CNNs)

CNN Concepts + Architectures — LeNet to EfficientNet, BN, 1×1, Dilation, Receptive Field
Convolution layer
Stack of learnable filters slid across input. Weight-shared across space → translation equivariance. Params , independent of .
Receptive field
Set of input pixels that influence one output unit. Grows with depth (and stride). stacked stride 1 → .
Pooling (max / avg / GAP)
Downsample with NO parameters. Max routes argmax (translation-robust). Avg distributes equally. GAP collapses , replaces giant FC.
Same / Valid padding
Same: (odd ) → output size equals input. Valid: → output shrinks by .
1×1 convolution
Per-pixel MLP across channels. Four uses: channel reduction (bottleneck), non-linearity injection, cross-channel mixing, cheap. Used in Inception, ResNet, MobileNet, SENet.
Dilated (atrous) convolution
Inserts zeros between kernel taps → effective kernel . Enlarges RF without parameters or stride. Used in DeepLab, WaveNet. Gridding artefact at large .
Depthwise-separable convolution
Depthwise ( per channel) + pointwise (). cheaper than standard conv at . Foundation of MobileNet, Xception.
Batch Normalisation (CNN)
Normalise per channel across ; learnable per channel. Inference uses running mean/var (not batch). Faster training, less init-sensitive.
Translation equivariance vs invariance
vs . Conv layers are equivariant; after global pool + FC, network is invariant. CNNs are NOT rotation equivariant.
LeNet
(LeCun, 1989/1998) First successful CNN. ~60k params. Two conv-pool blocks then FC. Template every later CNN follows.
AlexNet
(Krizhevsky et al., 2012) 60M params, ReLU + Dropout + augmentation, trained on 2 GPUs. Won ILSVRC-2012 by 10% top-5 — started the deep-learning era.
VGG
(Simonyan & Zisserman, 2014) Only convs, deep + simple. VGG-19: 143.7M params, mostly in FC layers. Top-1 74.2%.
Inception / GoogLeNet
(Szegedy et al., 2014) Parallel branches at multiple kernel sizes with 1×1 bottlenecks placed BEFORE expensive 3×3/5×5. Inception-v3: 27.2M params, top-1 77.3%.
ResNet
(He et al., 2015) residual connection. Gradient ⇒ no vanishing. Enables 100+ layers. ResNet-50: 25.6M params, 76.1% top-1 — the workhorse backbone.
DenseNet
(Huang et al., 2016) — concatenation (not addition). Strong gradient flow, feature reuse.
SENet (Squeeze-Excitation)
(Hu et al., 2017) Channel attention: GAP → 2-layer MLP → sigmoid scale → multiply. +1% top-1 at near-zero cost.
MobileNet
(Howard et al., 2017) Depthwise-separable convs throughout. ~4.2M params, ~70% top-1. Designed for phones / edge.
EfficientNet
(Tan & Le, 2019) Compound scaling: , . B0: 5.3M params, top-1 77.7%.
C3D / I3D / SlowFast
Video CNN family. C3D: 3D conv from scratch. I3D: inflate 2D pretrained weights along time. SlowFast: slow (semantics, low fps high channel) + fast (motion, high fps low channel) pathways.

Unit 5 — Object Detection

Object Detection — R-CNN family, YOLO, NMS, mAP
Bounding box (modal vs amodal)
Modal: covers only the visible portion of the object; standard in PASCAL VOC, COCO, KITTI. Amodal: covers the full extent including occluded parts; used in specialised benchmarks.
Anchor
A predefined box prior at fixed scale and aspect ratio; predictions are regression offsets from it. Faster R-CNN uses anchors per location (3 scales × 3 ratios).
Selective Search
Classical bottom-up segmentation (Uijlings et al., IJCV 2013). Starts from oversegmentation, greedily merges similar regions using colour + texture + size + fill similarity. Produces ~2000 class-agnostic proposals per image. Workhorse of R-CNN and Fast R-CNN; not learned.
RoI Pool
Project the proposal to the feature map (quantising to integer cells), divide into a fixed grid (typically ), max-pool per cell. Differentiable, but the two roundings cause sub-pixel misalignment in the input image — fatal for masks.
RPN (Region Proposal Network)
Faster R-CNN's learned, backbone-sharing proposal generator. conv → two heads (objectness + box regression). Translation-invariant: same anchor set at every spatial location.
FPN (Feature Pyramid Network)
Multi-scale feature representation with top-down lateral connections (Lin et al., CVPR 2017). High-resolution shallow features + low-resolution deep features fused via lateral convs. Lets one detector head handle small + large objects together. Standard in every modern detector.
GIoU
Generalised IoU. Adds the penalty where is the smallest enclosing box. Non-zero gradient even when boxes don't overlap. Bounded in .
Focal Loss
Cross-entropy multiplied by with (Lin et al., ICCV 2017). Crushes the gradient contribution of easy-classified examples — essential for single-stage detectors where background anchors vastly outnumber foreground.
Non-Maximum Suppression
Per-class procedure: sort detections by score; keep the top one; suppress all with IoU > ; repeat. typical. Fails in dense crowds where legitimate overlapping objects get suppressed.
Soft-NMS
Variant that DECAYS suppressed scores instead of zeroing them. Linear: . Gaussian: . Lets nearby legitimate objects coexist.
mAP
Mean of per-class Average Precision (area under PR curve). VOC uses 11-point interpolation at IoU = 0.5. COCO averages mAP over IoU thresholds (101-point interp per threshold). COCO numbers are smaller because the metric is stricter.
Smooth-L1 loss
Quadratic near zero (smooth gradient at the minimum), linear elsewhere (robust to outliers). Used for box regression in Fast/Faster R-CNN: if else .
Overfeat
Sermanet et al., ICLR 2014 (ILSVRC 2013 detection winner). Sliding-window classification + box regression, made practical by converting FC layers into convs so the whole image is processed in one forward pass.

Unit 6 — Dense Prediction: Segmentation + Depth

Dense Prediction — Segmentation & Monocular Depth
Semantic / Instance / Panoptic segmentation
Per-pixel class label / class + instance ID for things only / class + instance ID for everything (things and stuff). Panoptic is the most complete.
Things vs Stuff
Things: countable objects with distinct instances (person, car, animal). Stuff: amorphous, uncountable regions identified by texture/material (sky, road, water). Panoptic mixes both correctly.
FCN
Fully Convolutional Network (Long et al., CVPR 2015). Replace the classifier's FC with conv → per-pixel class map. Encoder downsamples; decoder upsamples; arbitrary input size.
Transposed convolution
Learnable upsampling: input pixel scales the filter at the output position; overlapping contributions sum. Equivalent matrix view: . NOT 'deconvolution'.
U-Net
Symmetric encoder-decoder with CONCAT skip connections at every resolution. Decoder gets deep semantic + shallow local features. Originally MICCAI 2015 medical imaging; now ubiquitous (Stable Diffusion's denoiser is a U-Net).
Dilated/Atrous convolution
Conv with gaps of size between kernel taps; multiplicatively expands receptive field without parameter growth or resolution loss. DeepLab's foundation. Risk: gridding artifacts at high rates.
ASPP (Atrous Spatial Pyramid Pooling)
Parallel branches of atrous convs at multiple rates (e.g., 6, 12, 18) for multi-scale context. Used in DeepLab v3.
Mask R-CNN
Faster R-CNN + a third FCN head producing binary mask per RoI per class. ICCV 2017, 41 k+ citations. Loss = .
RoI Align
Bilinear interpolation at exact float coordinates with 4 sample points per bin — NO rounding. Replaces RoI Pool's quantisation; critical for pixel-precise masks.
PointRend
Adaptive boundary refinement (Kirillov et al., CVPR 2020). Coarse mask → identify uncertain pixels (prob ≈ 0.5) → point-MLP refinement on high-res features → iterate. Sharper boundaries at modest extra cost.
Dice coefficient
. Same ranking as IoU. Denominator is SUM, not union. Dice slightly favours small objects.
mIoU
Mean IoU across classes. The standard segmentation metric — robust to class imbalance, unlike pixel accuracy.
MiDaS
Relative monocular depth (Ranftl et al., 2019+). Scale-and-shift-invariant L1 loss lets training combine heterogeneous depth sources (stereo, SfM, synthetic, web video).
ZoeDepth
MiDaS-style relative depth pretraining + metric depth fine-tuning on KITTI/NYU. Zero-shot transfer to metric depth (Bhat et al., 2023).
Focal Loss
with . Modulating factor crushes easy-classified examples; rebalances dense detection / segmentation. RetinaNet, ICCV 2017.

Unit 7 — Pose Estimation

Pose Estimation — Heatmaps, CPM, OpenPose, SMPL
Skeleton / keypoint representation
Fixed list of K body keypoints, each or . COCO uses 17. Simplest and most common pose representation.
DensePose
Per-pixel mapping from image to canonical 2D body surface — UV-mapping a 3D body to image pixels. Dense correspondence.
SMPL
Skinned Multi-Person Linear Model (Loper et al., 2015). Template + shape PCA + pose (24 joints × 3 axis-angle) → 6,890-vertex 3D mesh via linear blend skinning.
Human Mesh Recovery (HMR)
Regress SMPL parameters from a monocular image (Kanazawa et al., CVPR 2018). Trained with 2D reprojection loss to GT keypoints + optional 3D supervision + adversarial pose-plausibility loss.
Heatmap regression
Predict a 2D Gaussian belief map per joint instead of regressing . Per-pixel MSE against a Gaussian-centred target. argmax (or softargmax) at inference.
Convolutional Pose Machine (CPM)
Multi-stage pose architecture (Wei et al., CVPR 2016). Each stage refines belief maps using image features + previous stage's belief maps. Intermediate supervision at every stage combats vanishing gradients.
Part Affinity Field (PAF)
2D vector field per limb type. At each pixel along a limb, stores a unit vector along the limb direction; zero elsewhere. Used to group keypoints via line-integral scoring.
PCK / PCKh
Percentage of Correct Keypoints. A keypoint is correct if its distance to GT is below a threshold normalised by body size. PCKh@0.5 uses 0.5 × head bone length; PCK@0.2 uses 0.2 × torso diameter.
Bottom-up pose
Detect all keypoints in the image first (one pass), then group into individuals via association cues (PAFs + Hungarian matching). OpenPose paradigm. Runtime constant in number of people.
Top-down pose
Detect person bounding boxes first, then run single-person pose estimation per box. Mask R-CNN keypoints / AlphaPose. Higher per-person accuracy; runtime ; brittle on detection misses.
Sapiens
Meta's ECCV 2024 foundation model for humans. ViT backbone pretrained on 300M+ human-centric images; lightweight task heads for pose, segmentation, depth, normals. The 'foundation model + task heads' paradigm applied to human understanding.

Unit 8 — 3D Data (PointNet, DGCNN, MeshCNN)

3D Representations — VoxNet, PointNet, PointNet++, DGCNN, MeshCNN
Symmetric function
A function satisfying for every permutation . Permutation-invariant by definition.
Voxelization / occupancy grid
Discretising a point cloud onto a regular 3D grid; each voxel marks occupied/empty (or stores density). Enables 3D convolution but at cubic memory cost.
PointNet
Shared per-point MLP + symmetric max-pool over points + final MLP . Universal approximator for continuous symmetric set functions (Qi et al., 2017).
Critical points
The subset of input points whose per-point features actually survive PointNet's max-pool. They determine the global feature; perturbing other points has no effect.
T-Net
A mini-PointNet inside PointNet that predicts a learned alignment matrix ( for input, for features) and applies it before subsequent layers.
PointNet++
Hierarchical PointNet — sample anchors via Farthest Point Sampling, group neighbours via ball query, apply PointNet locally, then stack. Adds local context that vanilla PointNet lacks.
Farthest Point Sampling (FPS)
Greedy subsampling: pick first point arbitrarily, then iteratively add the point with maximum minimum-distance to the chosen set. Yields evenly-spaced anchors.
EdgeConv (DGCNN)
Per-edge feature over a kNN graph, max-aggregated. The graph is dynamic — rebuilt in feature space at each layer.
Dynamic graph
The kNN edges of DGCNN, recomputed in the current feature space at every layer (early layers: geometric neighbours; late layers: semantic neighbours).
MeshCNN
Edge-centric mesh operator with a 5-D intrinsic edge feature and a 4-edge fixed neighbourhood; conv uses symmetric input combinations; pooling = task-driven edge collapse.
Dihedral angle
The angle between the two triangular faces meeting at a mesh edge. One of MeshCNN's five intrinsic features.

Unit 9 — NeRF & 3D Gaussian Splatting

NeRF & 3DGS — Per-Scene Optimisation and Differentiable Rendering
Novel view synthesis
Given photos of a scene, render the scene from a new camera pose not in the original set.
Explicit vs implicit 3D representation
Explicit = enumerate primitives directly (points, mesh, voxels, Gaussians). Implicit = encode as a function (SDF, NeRF MLP). 3DGS is explicit-but-fuzzy: explicit primitives that are 3D Gaussians, not hard points.
Signed Distance Function (SDF)
Implicit representation returning the signed distance to the nearest surface; surface = zero-level set.
NeRF
Neural Radiance Field: MLP ; render by volumetric integration along camera rays.
COLMAP
Open-source Structure-from-Motion + multi-view-stereo pipeline. For 3DGS: provides camera intrinsics, extrinsics (poses), and a sparse point cloud for initialisation. Pose estimation in 3DGS is classical, not learned.
Spherical harmonics (SH)
Orthonormal angular basis on the unit sphere. Degree has functions. 3DGS uses → 16 per channel → 48 for RGB. Captures view-dependent appearance (specular highlights).
Adaptive Density Control (ADC)
Three operations during optimisation: Clone (small Gaussian, high pos gradient), Split (large Gaussian, high pos gradient → two smaller, scale / 1.6), Prune (low-opacity Gaussian).
Differentiable rasterisation
Tile-based GPU rasteriser whose every step (sort, project, composite) is differentiable, so pixel gradients flow back to .
Transmittance
— probability that light has passed through everything in front of Gaussian without being absorbed.
PSNR
Peak Signal-to-Noise Ratio = where for 8-bit. Higher better; unbounded; ~30 dB good for 8-bit.
SSIM
Structural Similarity Index, , higher better. Sliding-kernel comparison capturing luminance, contrast, structure.
LPIPS
Learned Perceptual Image Patch Similarity. Distance in the feature space of a pretrained AlexNet/VGG; lower better; captures human perception.

Unit 10 — Attention & Transformers

Attention Mechanism & Transformer Architecture
Seq2Seq bottleneck
Pre-attention encoder-decoder RNNs compressed the entire source into a single fixed hidden vector; performance collapsed on long inputs.
Bahdanau attention
Per-decoder-step weighted sum over encoder hidden states; weights computed by an additive MLP score; learns alignment as a byproduct.
Q / K / V
Query / Key / Value — three learned projections of the input. Self-attention: all from same sequence. Cross-attention: Q from decoder, K, V from encoder.
Scaled dot-product attention
. The keeps softmax in its useful (non-saturated) regime.
Multi-head attention
parallel attention heads, each with ; concatenate outputs and project with . Same total params as single-head; heads can specialise.
Causal mask (look-ahead mask)
Upper-triangular mask of ; added to attention scores pre-softmax; prevents the decoder from attending to future tokens.
Cross-attention
Q from decoder's current state, K, V from encoder's output. Equivalent to Bahdanau attention in Q-K-V form.
Sinusoidal positional encoding
Vaswani's at exponentially decreasing frequencies; allows linear expression of relative position; generalises to unseen lengths.
Pre-Norm vs Post-Norm
Post-Norm (Vaswani 2017): ; needs warmup. Pre-Norm (modern): ; stable for deep stacks.
Teacher forcing / student forcing
Training the decoder with ground-truth previous tokens (teacher) vs predicted previous tokens (student). Inference is always student-forcing.
Beam search
Maintain top- partial sequences at each step; expand and score; keep top- again. Trades compute for quality; typical.
Soft vs hard attention
Soft: continuous weighted average, differentiable. Hard: discrete sampling of one position, requires REINFORCE.

Unit 11 — Vision Transformers (ViT)

ViT Pipeline, Scaling, and Swin
Patch embedding
Linear projection of flattened patches to -dim token embeddings; equivalently a Conv2d with kernel and stride .
[CLS] token
Learnable summary token prepended to the patch sequence; its final hidden state is passed through a linear head for classification. Alternative: global-average-pool patch tokens (~equivalent accuracy).
Inductive bias
Architectural priors. CNNs have locality + translation equivariance baked in. ViTs have neither — they must learn them from data. Small data: bias helps. Large data: bias limits.
Pre-Norm
LayerNorm placed *before* each sublayer (MSA or MLP), with the residual added *after*. . Stable for deep stacks; used by ViT.
ViT-B/16
Vision Transformer Base with patches: , , MLP , , ~86M params.
Attention distance
Average spatial distance over which a head attends. CNNs: small in early layers, grows with depth. ViTs: spans both small and large distances even in layer 1.
CKA (Centered Kernel Alignment)
Similarity metric for comparing representations across layers/models. Used by Raghu et al. to show ViT and CNN representations differ qualitatively.
JFT-300M
Google's internal 300M-image dataset; ViT's pretraining target where it overtakes ResNet by a wide margin.
Swin Transformer
Shifted-Window Transformer (Liu et al., ICCV 2021). Window self-attention ( per layer) + shifted windows in alternate blocks (cross-window communication) + hierarchical patch merging (4-stage pyramid).
W-MSA / SW-MSA
Window multi-head self-attention with aligned windows / with shifted windows. Swin alternates these between blocks.
Patch merging (Swin)
group of patches concatenated and projected ; halves spatial dims and doubles channels — like strided conv in CNNs.
PE interpolation
Bilinearly interpolating learned 1D position embeddings to a new sequence length when fine-tuning at higher resolution; no new parameters.

Unit 12 — SSL: Contrastive (SimCLR, MoCo, BYOL, CLIP)

Contrastive SSL — SimCLR / MoCo / BYOL / CLIP
Self-supervised learning (SSL)
Supervision derived from structure within the data itself; no human labels. Four families: old-school pretext, contrastive, language-image contrastive, generative.
Positive / negative pair
Positive: two augmented views of the same image. Negative: views from different images. Contrastive methods pull positives together and push negatives apart.
InfoNCE / NT-Xent
Contrastive loss — softmax cross-entropy with a positive logit and many negative logits, scaled by temperature . SimCLR's specific form is NT-Xent.
Projection head $g_\phi$
Small MLP between the encoder and the contrastive loss. Discarded at downstream. Lets preserve broad features while enforces invariances.
Momentum encoder
Slowly-updated copy of the online encoder via EMA , . MoCo's key encoder; BYOL/DINO's target encoder.
Memory queue (MoCo)
FIFO buffer of k past key embeddings. Provides many negatives without large batch size. Updated each step by enqueuing the current batch's keys and dequeuing the oldest.
Predictor (BYOL)
Extra MLP on the online branch only — asymmetry that prevents collapse. Online = ; target = (no predictor).
Stop-gradient
Operator that blocks gradients during backprop. BYOL/DINO/JEPA all stop-grad on the target branch — the target is updated via EMA, not gradients.
Sinkhorn-Knopp algorithm (SwAV)
Iterative row/column normalisation that produces an equipartitioned soft cluster assignment. Prevents collapse to one prototype.
WIT (WebImageText)
CLIP's 400M-pair dataset; 500k queries × 20k pairs/query; scraped from the public internet.
Zero-shot classification (CLIP)
Embed class names as text prompts, embed the image, take the argmax cosine similarity. No labelled examples of target classes are seen.
DeViSE
Deep Visual-Semantic Embedding (Frome et al., NeurIPS 2013) — the 2013 precursor of CLIP; introduced image-text joint embedding 8 years earlier, at smaller scale.
Winoground
CVPR 2022 benchmark probing CLIP's compositional reasoning; pairs differ only in word order with matched swapped images. CLIP performs near chance.

Unit 13 — SSL: DINO, MAE, JEPA

DINO, MAE, JEPA — Modern SSL Beyond Contrastive
Self-distillation
Student trained to match teacher's output distribution; teacher and student share architecture; teacher updated via EMA of student. Cross-entropy loss. No negatives.
EMA teacher update
with cosine schedule . Teacher is a slowly-updated, smoothed version of the student.
Centering (DINO)
Subtract a running-mean bias from teacher logits before softmax. Prevents collapse to a single-dimension-dominated output. "Bias term added to logits."
Sharpening (DINO)
Apply a very low temperature () to teacher logits before softmax. Produces a peaky, confident target. Prevents collapse to uniform output.
Multi-crop
DINO's augmentation strategy: 2 global views (>50% area, 224 px) fed to both teacher and student + 6–10 local views (<50%, 96 px) fed to student only. Forces local-to-global consistency.
$[\text{CLS}]$ attention as emergent segmentation
In DINO-pretrained ViT, 's attention over patches concentrates on the salient object — producing usable segmentation-like maps with zero segmentation supervision.
Registers (DINOv2)
Extra learnable tokens prepended to the sequence with no positional embedding; absorb global scratchpad activity so real patch tokens have cleaner attention maps.
Masked Autoencoder (MAE)
Patchify image → randomly mask 75% → deep encoder on visible 25% only → light decoder reconstructs masked-patch pixels via MSE. Asymmetric architecture; encoder kept, decoder discarded.
Mask ratio (MAE)
Fraction of patches masked. 75% in MAE vs 15% in BERT — images have more spatial redundancy, so higher masking is required to prevent texture-copying shortcut.
JEPA (Joint-Embedding Predictive Architecture)
LeCun's program: predict target *representations* (from a separate EMA-updated target encoder) rather than pixels. Context encoder + target encoder + predictor; L2 in feature space with stop-gradient on target.
I-JEPA / V-JEPA / VL-JEPA
Image JEPA (Assran et al., CVPR 2023) / Video JEPA (2024) / Vision-Language JEPA (2025). Same recipe, different modalities.
Stop-gradient
Operator that blocks gradients during backprop. DINO/JEPA/BYOL all stop-grad the target branch — the target is updated via EMA, not gradients.

Unit 14 — Transformer Advances (ViT-5 era)

Modern Transformer Upgrades
Residual stream
The unbroken identity path in a Pre-Norm Transformer; gradients flow directly through it from output back to input.
Pre-Norm
. LayerNorm placed before the sublayer; residual stream is never normalised. Modern default.
Post-Norm
. 2017 original; needs careful warmup at depth.
RMSNorm
Drop-the-mean variant of LayerNorm: . No mean subtraction, no bias . Cheaper and as effective.
LayerScale
Per-channel learnable diagonal on each sublayer's residual contribution, initialised . Makes deep nets train as near-identity at init.
QK-Norm
Apply LayerNorm (or RMSNorm) to and separately before the attention dot product. Prevents softmax saturation at long sequences / large head dims.
Registers
Extra learnable tokens prepended to the input with no positional encoding; act as global scratchpad. Without them, the model corrupts uninformative patches. Discarded at output.
Flash Attention
Exact, IO-aware attention algorithm. Tiles into SRAM-sized blocks; computes streaming online softmax; never materialises the matrix in HBM. Memory . 2–4× faster.
Online softmax
Stream the softmax: keep running max , denominator , output ; rescale when grows. Mathematically equivalent to standard softmax, computable tile-by-tile.
RoPE (Rotary Position Embedding)
Rotate pairs of dimensions of and by angle proportional to position. Attention dot product depends only on relative position .
KV-cache
Cache of for past tokens during autoregressive generation. Per-step compute ; total generation . Memory grows linearly.
MHA / GQA / MQA
Multi-Head Attention: separate heads. Grouped-Query: shared groups. Multi-Query: (one for all heads). KV cache ratio: .

Unit 15 — Multimodal LLMs (PaliGemma / Qwen2-VL / Gemma 4)

VLM Architecture — Encoders, Connectors, Positional Encoding
Modality gap
Text is discrete (vocab → lookup); images are continuous (pixels → encoder). A VLM aligns them into a shared latent space.
Three-pillar blueprint
Vision Encoder → Connector → LLM Backbone. Visual + text tokens concatenated, vanilla self-attention handles cross-modal reasoning.
Connector / adapter
Small (linear or MLP) projection from vision-encoder output dim to LLM token dim. In stitched VLMs, the only randomly initialised component.
SigLIP
Sigmoid CLIP. Pairwise binary cross-entropy with logits . Scales without batch-wide softmax sync.
Prefix-LM mask
Bidirectional attention over image + prompt; causal over answer. Loss computed only on answer tokens.
Location tokens
1024 extended-vocab tokens encoding normalised bounding-box coordinates. Detection output is pure text.
Segmentation codewords
128 extended-vocab tokens from a learned VQ-VAE codebook. Segmentation mask = bbox + codeword sequence decoded to pixels.
Dynamic resolution (Qwen2-VL)
Process at native aspect ratio + resolution; tile-count clamped by . Fixes resolution bottleneck + aspect distortion.
2D-RoPE
Split head dim in halves; rotate first half by row, second half by column. Attention depends on 2D relative displacement.
M-RoPE
Multimodal RoPE (Qwen2-VL): head dim split in thirds for rotations. Static images: . Text tokens: all three = token index.
Native multimodal (Gemma 4)
Vision and language share Transformer blocks from early layers; no connector. Modalities meet inside the Transformer with shared weights.
VLA (Vision-Language-Action)
VLM + action de-tokeniser. Continuous actions discretised into 256 bins per dimension; treated as just another token type.
OpenVLA
DinoV2 + SigLIP image encoders + MLP connector + LLaMA-2 7B + action de-tokeniser. Trained on 970k robot trajectories. ~7B params.

Unit 16 — Video Understanding

Video Architectures — 3D CNNs, Two-Stream, SlowFast, ViViT, TimeSformer
Action classification
One label per trimmed clip — "is this dancing?" Datasets: UCF, HMDB, Kinetics.
Temporal action localisation
Return start/end intervals of actions in an untrimmed video. "Dance: 12.4s–18.7s".
Spatio-temporal action localisation
Bounding box AND time interval per action. AVA is canonical.
Kinetics
Carreira & Zisserman, CVPR 2017. 400/600/700 action classes. The ImageNet for videos — pretraining target for nearly every modern video model.
AVA
Atomic Visual Actions (Gu, CVPR 2018). Spatio-temporally localised — box + time interval + label across pose, person-object, person-person categories.
Optical flow
Per-pixel 2D vector describing motion between consecutive frames. Explicit motion signal that RGB-only models miss.
3D convolution (Conv3D)
Convolution with a kernel sliding over a video volume. Captures spatio-temporal neighbours.
I3D (Inflated 3D ConvNet)
Carreira & Zisserman, CVPR 2017. Take a 2D ImageNet CNN, inflate each filter to by replicating along time and dividing by . Fine-tune on Kinetics.
Two-Stream networks
Simonyan & Zisserman, NeurIPS 2014. Parallel spatial (RGB) + temporal (optical flow, channels) CNNs, late fusion.
LRCN
Long-term Recurrent Convolutional Networks (Donahue et al., CVPR 2015). Per-frame CNN encoder → LSTM. Good for variable-length outputs (captions).
SlowFast
Feichtenhofer et al., CVPR 2019. Slow pathway (low fps, high channels — semantics) + Fast pathway (high fps, low channels — motion) with lateral fusion.
ViViT
Video Vision Transformer (Arnab, ICCV 2021). Two token-extraction strategies: per-frame patches or 3D *tubelets* (). Explores attention factorisations.
Tubelet embedding
ViViT's strategy of linearly projecting 3D spatio-temporal cubes (e.g., ) into tokens — encodes spatio-temporal info from the start.
Divided space-time attention
TimeSformer's winning factorisation: each block does temporal attention (per spatial location across time) then spatial attention (per frame). Near-linear cost.
Dense Event Captioning
Krishna, ICCV 2017. Given a long video, output a list of events with time intervals AND captions; multiple overlapping events allowed.