Saral Shiksha Yojna
Courses/Computer Vision

Computer Vision

CSE471
Prof. Makarand Tapaswi + Prof. Charu SharmaSpring 2025-264 credits
Revision Notes/Unit 8 — SSL: Contrastive (SimCLR, MoCo, BYOL, CLIP)/Contrastive SSL — SimCLR / MoCo / BYOL / CLIP

Contrastive SSL — SimCLR / MoCo / BYOL / CLIP

NotesStory

Intuition

Every method before this needed *labels* — annotated boxes, pixel masks, class names. Then somebody asked: *what if the supervision could come from the data itself?* You hand the network an image with no label, but you create a task that the image can answer with its own structure — *"these two random crops came from the same image"*. Solve enough of these synthetic puzzles and the resulting features generalise. This is self-supervised learning (SSL), and over five years it has become the dominant pretraining recipe in vision. CLIP, DINO, MAE — every modern vision encoder is SSL-pretrained.

Explanation

Where the supervision comes from — the structural signals SSL exploits. *Language:* grammar (predict next word), fill-in-the-blanks (BERT), sentence ordering. *Images:* the old-school *pretext tasks* — colorisation (predict the colour version of a grayscale image), jigsaw puzzles (predict the original arrangement of 9 shuffled patches), neighbourhood proximity (predict spatial relationship between two patches). These produced OK features but the link from "solving jigsaws" to "classifying golden retrievers" was indirect. What replaced them is a more direct approach: contrastive learning.

The four-family taxonomy — memorise. *(1)* Old-school SSL — jigsaw, colorisation, autoencoders. *(2)* Contrastive — SimCLR, MoCo, BYOL, SwAV (and DINO from the next unit). *(3)* Language-image contrastive — CLIP, SigLIP. *(4)* Generative — masked autoencoders (MAE).

The Gelato Bet — a piece of trivia worth knowing. Alyosha Efros bet that by a deadline, a single self-supervised model would match supervised ImageNet pretraining on a comprehensive benchmark. The bet was won — SSL caught up around 2020–2021. If an exam question references "the Gelato Bet," it's about SSL catching up to supervised pretraining.

The contrastive recipe — the one-line principle. For each image, generate two augmented *views*. The two views are a positive pair (same image). All other images in the batch are negatives. Train the network so positive pairs have similar embeddings, and positive-vs-negative pairs have different embeddings. *That's it.* What changes between methods is what augmentations, where negatives come from, and how the loss is structured.

SimCLR — the simplest contrastive framework (Chen et al., ICML 2020). Four components — name them. *(1)* Data augmentation pipeline generates positive pairs. *(2)* **Encoder ** — a ResNet — produces representations . *(3)* **Projection head ** — a small 2-layer MLP — maps to a contrastive space . *(4)* Contrastive loss — NT-Xent / InfoNCE. **The contrastive loss is computed on , not .** After pretraining, is *discarded* and only is used downstream. Memorise: projection head present at training, thrown away at downstream.

SimCLR's aggressive augmentations. The headline finding was that aggressive augmentation is essential. Random cropping + random colour jitter are the dominant two; both must be strong, and the combination matters more than either individually. Each augmentation isolates a kind of invariance — crops teach scale/translation invariance, colour jitter teaches colour invariance; together they prevent the network from shortcutting via easy cues.

The NT-Xent loss (= InfoNCE). Given a batch of images and 2 augmentations → views. Cosine similarity . For one positive pair (the that was augmented from the same source as ): . The denominator runs over all other views. Temperature (typically ) controls softmax sharpness. The connection — exam gold: NT-Xent is *softmax cross-entropy*, with logits = similarity scores and the "true label" = the index of the positive partner. Total loss averaged over all positive-pair directions.

What SimCLR taught the field. *(a)* Bigger batches are dramatically better — denominator covers more negatives; SimCLR scaled to batches of 4096–8192. *(b)* The projection head matters — putting the loss on rather than directly on gives 5–10% better downstream accuracy. *(c)* Aggressive augmentations are essential — especially the crop + colour-jitter pair. But batch size 8192 means SimCLR only runs on TPU pods — smaller labs can't replicate. The next paper solved that.

MoCo — Momentum Contrast (He et al., CVPR 2020). Keeps the contrastive recipe but decouples negatives from batch size. *Idea 1 — a memory queue.* Maintain a large FIFO queue of past representations. New batch's negatives come from this queue, not from the current batch. Huge queue → many negatives without huge batch. *Idea 2 — a momentum encoder.* Maintain a separate slowly-updated EMA of the online encoder: with . The momentum encoder produces all *keys* (positive + negatives in the queue). Because changes slowly, all queue entries look like they came from approximately the same encoder — *consistency*.

MoCo loop. For each batch: query ; positive key ; negative keys = the queue. InfoNCE: tries to match against . Backprop only updates ; update by EMA of ; enqueue the current batch's keys, dequeue the oldest. Trade-off to memorise: SimCLR needs huge batches because every batch is the negative pool. MoCo decouples — small batches, large queue. (DINO's teacher EMA, next unit, comes directly from this.)

BYOL — no negatives at all (Grill et al., NeurIPS 2020). "Bootstrap Your Own Latent" asked the most heretical question of the contrastive era: *what if we drop the negatives entirely?* The contrastive intuition said this would collapse — the network would output the same vector for every image. BYOL showed this isn't true if the architecture is right.

BYOL setup. *Online network:* — encoder + projector + predictor. *Target network:* — encoder + projector (no predictor); EMA of . Loss: . Three critical elements: *(a)* predictor on the online branch ONLY — asymmetry; *(b)* stop-gradient on the target — no backprop through ; *(c)* EMA update of . Why doesn't it collapse? The predictor breaks symmetry (only the online side can perfectly match → can't trivially go to a constant); the EMA delay means the target is always slightly behind the online → chasing a moving target prevents lock-up. Empirically, it just works — and it led directly to DINO.

SwAV — contrastive clustering (Caron et al., NeurIPS 2020). "Swapping Assignments between Views." Learn prototype vectors (a codebook, ). For each augmented view : compute ; compute a soft cluster assignment via the Sinkhorn-Knopp algorithm (enforces that across a batch each prototype is used approximately equally — prevents collapse to one prototype). The *swap*: from view , predict (the other view's cluster assignment) from . Computationally cheap (no pairwise similarities), and Sinkhorn-Knopp gives an elegant anti-collapse equipartition.

The contrastive family in one table. *SimCLR (2020):* yes, in-batch negatives; aggressive augmentation + projection head. *MoCo (2020):* yes, queue + momentum encoder. *BYOL (2020):* no negatives; predictor + EMA + stop-grad. *SwAV (2020):* implicit (cluster prototypes); online clustering, Sinkhorn. *DINO (2021):* no negatives; self-distillation + multi-crop (next unit).

CLIP — when language joined the party (Radford et al., 2021). Everything above used images only. CLIP uses paired text as the supervision signal. Three motivations for natural-language supervision. *(a)* Category labels don't scale — ImageNet has 22k classes; beyond 10k, getting good per-class labels is infeasible. There's just no way to enumerate all visual concepts. *(b)* Limited descriptive potential — "white shirt" vs "blue shirt" — separate classes? You'd need 10k just for clothing colours. *(c)* Not compositional — "laptop on top of a table" combines two objects and a spatial relationship; no class for every composition. Natural language has none of these limitations, and the internet has billions of image-text pairs for free.

A piece of history — DeViSE (Frome et al., NeurIPS 2013). Deep Visual-Semantic Embedding — the earliest attempt to learn a joint visual-textual embedding space. Pre-dated CLIP by 8 years; lacked the scale. The idea of image-text joint embedding is from 2013; CLIP just made it work at scale.

CLIP architecture. Two encoders, one shared embedding space. **Image encoder ** — typically a ResNet or ViT — outputs . **Text encoder ** — a Transformer over text tokens — outputs . Both L2-normalised. Both project to a shared -dimensional space via learned .

The CLIP loss — symmetric InfoNCE (memorise the pseudocode). Compute features ; project and L2-normalise to get ; logits for learned temperature . The matrix has *positives on the diagonal* (image ↔ text ). Cross-entropy along rows (image→text retrieval) + cross-entropy along columns (text→image retrieval), averaged: . SigLIP (later) replaces this softmax with independent sigmoids; CLIP is the softmax-InfoNCE original.

The secret sauce — data, not architecture. CLIP's secret was WIT (WebImageText): 400 million pairs scraped from the public internet, constructed using 500k search queries with up to 20k pairs per query for class balance. Word count comparable to GPT-2's training corpus. *CLIP's architectural contribution was modest; its data contribution was unprecedented.* Exam line: *why does CLIP work?* → *scale of natural-language supervision, 400M pairs*.

Zero-shot classification — the magic trick. For a new image classification task: *(1)* embed each class name as text — *"a photo of a {label}"* — via the text encoder; *(2)* embed the input image; *(3)* compute cosine similarity image-vs-each-text; *(4)* predict argmax. No training, no fine-tuning, no labelled images of the new task. CLIP achieves ~76% top-1 zero-shot on ImageNet, competitive with fully-supervised ResNet-50.

CLIP's limitations — the "Why?" slide. *(a)* Not fine-grained — colours, exact part attributes are lost. *(b)* Not compositional — *"person on horse"* vs *"horse on person"* score similarly because CLIP matches global representations, not relational structure. *(c)* Matches global representations — insufficient for spatial relationships. *(d)* Needs hard negatives for compositional reasoning. Two papers diagnose: Winoground (Thrush et al., CVPR 2022) — CLIP performs near chance on captions that differ only in word order with matched swapped images. CLIP Association Bias (Yamada et al., EMNLP 2023, *"When are Lemons Purple?"*) — CLIP associates concepts in unexpected ways via textual co-occurrence (asked about "lemons," sometimes associates purple). These limitations motivate SigLIP, BLIP, and the MLLM line.

CLIP's impact. Its visual encoder is one of the most-used vision encoders today. LAION-5B — open-source 5-billion-pair replication of WIT. BLIP — CLIP-style ideas with added captioning. InstructBLIP — instruction-tuned, uses CLIP's visual encoder + an LLM. Stable Diffusion — uses CLIP's text encoder to condition image generation; *every Stable Diffusion image you see was generated by a model that uses CLIP inside*. SigLIP — directly succeeds CLIP, swaps softmax for sigmoid.

Definitions

  • Self-supervised learning (SSL)Supervision derived from structure within the data itself; no human labels. Four families: old-school pretext, contrastive, language-image contrastive, generative.
  • Positive / negative pairPositive: two augmented views of the same image. Negative: views from different images. Contrastive methods pull positives together and push negatives apart.
  • InfoNCE / NT-XentContrastive loss — softmax cross-entropy with a positive logit and many negative logits, scaled by temperature . SimCLR's specific form is NT-Xent.
  • Projection head $g_\phi$Small MLP between the encoder and the contrastive loss. Discarded at downstream. Lets preserve broad features while enforces invariances.
  • Momentum encoderSlowly-updated copy of the online encoder via EMA , . MoCo's key encoder; BYOL/DINO's target encoder.
  • Memory queue (MoCo)FIFO buffer of k past key embeddings. Provides many negatives without large batch size. Updated each step by enqueuing the current batch's keys and dequeuing the oldest.
  • Predictor (BYOL)Extra MLP on the online branch only — asymmetry that prevents collapse. Online = ; target = (no predictor).
  • Stop-gradientOperator that blocks gradients during backprop. BYOL/DINO/JEPA all stop-grad on the target branch — the target is updated via EMA, not gradients.
  • Sinkhorn-Knopp algorithm (SwAV)Iterative row/column normalisation that produces an equipartitioned soft cluster assignment. Prevents collapse to one prototype.
  • WIT (WebImageText)CLIP's 400M-pair dataset; 500k queries × 20k pairs/query; scraped from the public internet.
  • Zero-shot classification (CLIP)Embed class names as text prompts, embed the image, take the argmax cosine similarity. No labelled examples of target classes are seen.
  • DeViSEDeep Visual-Semantic Embedding (Frome et al., NeurIPS 2013) — the 2013 precursor of CLIP; introduced image-text joint embedding 8 years earlier, at smaller scale.
  • WinogroundCVPR 2022 benchmark probing CLIP's compositional reasoning; pairs differ only in word order with matched swapped images. CLIP performs near chance.

Formulas

Derivations

NT-Xent ≡ softmax cross-entropy. With one positive logit and negative logits , softmax cross-entropy for the positive class is . NT-Xent is exactly this with and the positive picked by augmentation pairing. *Same equation, two names.*

**Why a projection head improves downstream.** The contrastive loss is *aggressive* — it forces for augmentation pairs, which removes information that's invariant to those augmentations (colour, position). If you put the loss directly on , the encoder has to throw away colour and position too. Inserting between and the loss lets preserve broad image information; learns a contrastive-specific subspace where invariances are enforced. *Discard at downstream → recover the full .* Empirical gain: 5–10% on ImageNet linear probe.

Why MoCo's momentum encoder gives queue consistency. Without EMA, the queue would hold keys produced by many *different* parameter snapshots over thousands of past iterations — inconsistent neighbours in feature space, training noise dominates. With , changes at ~0.1%/step; after the queue length (e.g. 65k steps), has changed by ~, but the change is *smooth* — adjacent queue entries come from nearly-identical . Smoothness, not slowness per se, is what matters.

Why BYOL's stop-gradient is essential. Without stop-gradient, gradients flow symmetrically through both branches; the system can trivially minimise the loss by making both networks output the same constant — *representation collapse*. The stop-gradient + EMA forces the target to be a (delayed) function of the online, not jointly optimised with it; the asymmetric predictor then has to *learn* to match, not just copy.

The CLIP logits matrix and its symmetric CE. is — entry is cosine similarity between image and text . *Row *: softmax over → "which text caption matches image ?" — image-to-text. *Column *: softmax over → "which image matches text ?" — text-to-image. Both cross-entropies are minimised when the *diagonal* is the argmax in every row and column. Averaging gives the symmetric loss.

Examples

  • SimCLR augmentation ablation. Random crop + colour jitter + Gaussian blur gives the best representation. Ablating *colour jitter* hurts more than ablating any other single augmentation — the model otherwise shortcuts on consistent colour statistics between the two views.
  • SimCLR batch-size dependency. Batch 256 → ~64% ImageNet linear probe. Batch 4096 → ~70%. Batch 8192 → ~76%. The InfoNCE denominator's size is the bottleneck.
  • MoCo at batch 256. With queue and momentum , MoCo matches SimCLR at batch 4096 — *order-of-magnitude memory savings*.
  • BYOL collapse without stop-gradient. Run BYOL with grad flowing through the target branch → after a few hundred steps, both networks output the same constant vector, loss = 0, useless features.
  • Zero-shot ImageNet with CLIP. Build the prompt set "a photo of a goldfish", "a photo of a tiger shark", , "a photo of a yellow lady's slipper" (1000 prompts). Encode all. Embed an image. Cosine similarity → argmax. 76.2% top-1 zero-shot, matching a fully-supervised ResNet-50.
  • CLIP retrieval matrix. Batch of 32 images and their captions. Compute the similarity matrix. Sum the diagonal entries divided by row-sums → batch-level recall@1.
  • Winoground failure. "A dog on a horse" vs "a horse on a dog" — two captions, two correct matched images. CLIP scores both with near-identical cosines; accuracy at distinguishing them is barely above chance.

Diagrams

  • SimCLR pipeline. Image → two augmentations → two views → shared encoder → projection → NT-Xent against the other views.
  • MoCo pipeline. Query encoder + slowly-updated momentum key encoder + memory queue of past keys. Query matches positive against the queue's negatives. Backprop updates query only; queue updates by EMA.
  • BYOL pipeline. Online (encoder + projector + predictor) on top branch; target (encoder + projector, no predictor) on bottom branch with stop-gradient; predicts .
  • SwAV with prototypes. Two views → encoder → features ; each compared against prototype vectors; Sinkhorn-Knopp gives equipartitioned soft assignments ; cross-entropy swaps: predict from and from .
  • CLIP training matrix. image-text cosine matrix; diagonal positives; rows = image-to-text CE; columns = text-to-image CE; loss = average.
  • CLIP zero-shot inference. class names → text embeddings; one image → one image embedding; cosine similarities → argmax → predicted class.

Edge cases

  • SimCLR with batch 256 underperforms supervised. Needs batch 4k+ for enough negatives — the denominator is the bottleneck.
  • BYOL without stop-gradient collapses to constant. Both networks output the same vector; loss = 0; features useless. Confirms the stop-gradient is the load-bearing element.
  • CLIP fine-grained classification (dog breeds). Zero-shot drops sharply on Stanford Dogs vs ImageNet — global representations don't separate "golden retriever" from "yellow Labrador".
  • CLIP compositional reasoning fails on Winoground; on relational scenes ("laptop *on* a table" vs "laptop *under* a table"), accuracy is near chance.
  • MoCo queue staleness. If is too low (e.g. ), queue keys are from rapidly-changing encoders → inconsistent → poor features. is the standard sweet spot.
  • SwAV cluster collapse without Sinkhorn. Without the equipartition constraint, all views map to a single prototype; Sinkhorn is the load-bearing piece.
  • CLIP prompt engineering. "a photo of a {}" works; "a {} " works less well; multi-prompt ensembles ("a photo of a {}", "a sketch of a {}", "an image of a {}", ...) give 1–3% extra zero-shot accuracy. CLIP is sensitive to text prompts.

Common mistakes

  • Stating SimCLR needs no negatives — it does (BYOL is the one without).
  • Keeping the projection head for downstream and discarding the opposite is correct.
  • Treating CLIP's loss as one-directional CE — it's symmetric over rows AND columns.
  • Calling CLIP zero-shot *"few-shot"* — no labelled examples of the target classes are seen.
  • Saying MoCo's momentum encoder is *trained by backprop* — it's not; only by EMA of the online encoder.
  • Confusing BYOL's *predictor* with MoCo's *momentum encoder* — different mechanisms (asymmetric MLP on online vs slow EMA on the key side).
  • Stating *"CLIP was a new architecture"* — its architecture was modest; the contribution was the 400M-pair WIT data.
  • Writing the InfoNCE denominator with only negatives — it includes the positive too (the positive logit appears in both numerator and denominator).

Shortcuts

  • Four ingredients of SimCLR: *augmentation → encoder → projection → NT-Xent*. Discard at downstream.
  • MoCo memorisation: queue + momentum encoder. Decouples #negatives from batch size.
  • BYOL memorisation: predictor + stop-gradient + EMA. No negatives. Three required for no collapse.
  • SwAV memorisation: prototypes + Sinkhorn-Knopp equipartition + swap cluster assignments.
  • NT-Xent = softmax cross-entropy with similarity logits and the positive index as the "label".
  • **CLIP loss = (row CE + column CE)**, symmetric.
  • Zero-shot template: *"a photo of a {class}"* + cosine + argmax. Memorise.
  • WIT scale: 400M pairs, 500k queries, 20k pairs/query.

Proofs / Algorithms

NT-Xent equals softmax cross-entropy. Define logits . The softmax-CE for a positive at index is . This is exactly NT-Xent's .

BYOL's anti-collapse intuition. With stop-grad and EMA, the target is a delayed function of the online; the predictor must learn an explicit map online → target. If both branches output the same constant, the predictor sees no signal — but minimising over data requires to be informative; under suitable initialisation + augmentation, the dynamics converge to a non-trivial fixed point. (Formal analyses by Tian et al. show that the predictor's eigenstructure prevents collapse.)

Why bigger SimCLR batches help. InfoNCE is variational lower bound on mutual information; the bound becomes tight only as the number of negatives grows. Concretely, enters as an upper bound on the MI estimate, so increasing (negatives) raises the achievable bound. Empirically, performance saturates around k for SimCLR; MoCo reaches the same with k from the queue.