Revision Notes/Unit 8 — SSL: Contrastive (SimCLR, MoCo, BYOL, CLIP)/Contrastive SSL — SimCLR / MoCo / BYOL / CLIP

Contrastive SSL — SimCLR / MoCo / BYOL / CLIP

Intuition

Every method before this needed *labels* — annotated boxes, pixel masks, class names. Then somebody asked: *what if the supervision could come from the data itself?* You hand the network an image with no label, but you create a task that the image can answer with its own structure — *"these two random crops came from the same image"*. Solve enough of these synthetic puzzles and the resulting features generalise. This is self-supervised learning (SSL), and over five years it has become the dominant pretraining recipe in vision. CLIP, DINO, MAE — every modern vision encoder is SSL-pretrained.

Explanation

Where the supervision comes from — the structural signals SSL exploits. *Language:* grammar (predict next word), fill-in-the-blanks (BERT), sentence ordering. *Images:* the old-school *pretext tasks* — colorisation (predict the colour version of a grayscale image), jigsaw puzzles (predict the original arrangement of 9 shuffled patches), neighbourhood proximity (predict spatial relationship between two patches). These produced OK features but the link from "solving jigsaws" to "classifying golden retrievers" was indirect. What replaced them is a more direct approach: contrastive learning.

The four-family taxonomy — memorise. *(1)* Old-school SSL — jigsaw, colorisation, autoencoders. *(2)* Contrastive — SimCLR, MoCo, BYOL, SwAV (and DINO from the next unit). *(3)* Language-image contrastive — CLIP, SigLIP. *(4)* Generative — masked autoencoders (MAE).

The Gelato Bet — a piece of trivia worth knowing. Alyosha Efros bet that by a deadline, a single self-supervised model would match supervised ImageNet pretraining on a comprehensive benchmark. The bet was won — SSL caught up around 2020–2021. If an exam question references "the Gelato Bet," it's about SSL catching up to supervised pretraining.

The contrastive recipe — the one-line principle. For each image, generate two augmented *views*. The two views are a positive pair (same image). All other images in the batch are negatives. Train the network so positive pairs have similar embeddings, and positive-vs-negative pairs have different embeddings. *That's it.* What changes between methods is what augmentations, where negatives come from, and how the loss is structured.

SimCLR — the simplest contrastive framework (Chen et al., ICML 2020). Four components — name them. *(1)* Data augmentation pipeline generates positive pairs. *(2)* **Encoder $f_{θ}$ ** — a ResNet — produces representations $h$ . *(3)* **Projection head $g_{ϕ}$ ** — a small 2-layer MLP — maps $h$ to a contrastive space $z$ . *(4)* Contrastive loss — NT-Xent / InfoNCE. **The contrastive loss is computed on $z$ , not $h$ .** After pretraining, $g_{ϕ}$ is *discarded* and only $h$ is used downstream. Memorise: projection head present at training, thrown away at downstream.

SimCLR's aggressive augmentations. The headline finding was that aggressive augmentation is essential. Random cropping + random colour jitter are the dominant two; both must be strong, and the combination matters more than either individually. Each augmentation isolates a kind of invariance — crops teach scale/translation invariance, colour jitter teaches colour invariance; together they prevent the network from shortcutting via easy cues.

The NT-Xent loss (= InfoNCE). Given a batch of $N$ images and 2 augmentations → $2 N$ views. Cosine similarity $sim (u, v) = u^{⊤} v / (∥ u ∥∥ v ∥)$ . For one positive pair $(z_{i}, z_{j})$ (the $j$ that was augmented from the same source as $i$ ): $L_{i, j} = - lo g (exp (sim (z_{i}, z_{j}) / τ) / \sum_{k \neq = i} exp (sim (z_{i}, z_{k}) / τ))$ . The denominator runs over all other $2 N - 1$ views. Temperature $τ$ (typically $0.07$ – $0.5$ ) controls softmax sharpness. The connection — exam gold: NT-Xent is *softmax cross-entropy*, with logits = similarity scores and the "true label" = the index of the positive partner. Total loss averaged over all $2 N$ positive-pair directions.

What SimCLR taught the field. *(a)* Bigger batches are dramatically better — denominator covers more negatives; SimCLR scaled to batches of 4096–8192. *(b)* The projection head matters — putting the loss on $g_{ϕ} (h)$ rather than directly on $h$ gives 5–10% better downstream accuracy. *(c)* Aggressive augmentations are essential — especially the crop + colour-jitter pair. But batch size 8192 means SimCLR only runs on TPU pods — smaller labs can't replicate. The next paper solved that.

MoCo — Momentum Contrast (He et al., CVPR 2020). Keeps the contrastive recipe but decouples negatives from batch size. *Idea 1 — a memory queue.* Maintain a large FIFO queue of $K = 65, 536$ past representations. New batch's negatives come from this queue, not from the current batch. Huge queue → many negatives without huge batch. *Idea 2 — a momentum encoder.* Maintain a separate slowly-updated EMA of the online encoder: $ξ \leftarrow m ξ + (1 - m) θ$ with $m \approx 0.999$ . The momentum encoder produces all *keys* (positive + negatives in the queue). Because $ξ$ changes slowly, all queue entries look like they came from approximately the same encoder — *consistency*.

MoCo loop. For each batch: query $q = f_{θ} (x_{aug1})$ ; positive key $k^{+} = f_{ξ} (x_{aug2})$ ; negative keys = the queue. InfoNCE: $q$ tries to match $k^{+}$ against ${k^{-} in queue}$ . Backprop only updates $θ$ ; update $ξ$ by EMA of $θ$ ; enqueue the current batch's keys, dequeue the oldest. Trade-off to memorise: SimCLR needs huge batches because every batch is the negative pool. MoCo decouples — small batches, large queue. (DINO's teacher EMA, next unit, comes directly from this.)

BYOL — no negatives at all (Grill et al., NeurIPS 2020). "Bootstrap Your Own Latent" asked the most heretical question of the contrastive era: *what if we drop the negatives entirely?* The contrastive intuition said this would collapse — the network would output the same vector for every image. BYOL showed this isn't true if the architecture is right.

BYOL setup. *Online network:* $f_{θ} \to g_{θ} \to q_{θ}$ — encoder + projector + predictor. *Target network:* $f_{ξ} \to g_{ξ}$ — encoder + projector (no predictor); $ξ \leftarrow$ EMA of $θ$ . Loss: $∥ normalise (q_{θ} (g_{θ} (f_{θ} (v)))) - normalise (sg [g_{ξ} (f_{ξ} (v^{'}))]) ∥^{2} = 2 - 2 cos (online, target)$ . Three critical elements: *(a)* predictor $q_{θ}$ on the online branch ONLY — asymmetry; *(b)* stop-gradient on the target — no backprop through $ξ$ ; *(c)* EMA update of $ξ$ . Why doesn't it collapse? The predictor breaks symmetry (only the online side can perfectly match → can't trivially go to a constant); the EMA delay means the target is always slightly behind the online → chasing a moving target prevents lock-up. Empirically, it just works — and it led directly to DINO.

SwAV — contrastive clustering (Caron et al., NeurIPS 2020). "Swapping Assignments between Views." Learn $K$ prototype vectors (a codebook, $K \approx 3000$ ). For each augmented view $v$ : compute $z = f_{θ} (v) /∥ f_{θ} (v) ∥$ ; compute a soft cluster assignment $q \in R^{K}$ via the Sinkhorn-Knopp algorithm (enforces that across a batch each prototype is used approximately equally — prevents collapse to one prototype). The *swap*: from view $v$ , predict $q_{v^{'}}$ (the other view's cluster assignment) from $z_{v}$ . Computationally cheap (no pairwise similarities), and Sinkhorn-Knopp gives an elegant anti-collapse equipartition.

The contrastive family in one table. *SimCLR (2020):* yes, in-batch negatives; aggressive augmentation + projection head. *MoCo (2020):* yes, queue + momentum encoder. *BYOL (2020):* no negatives; predictor + EMA + stop-grad. *SwAV (2020):* implicit (cluster prototypes); online clustering, Sinkhorn. *DINO (2021):* no negatives; self-distillation + multi-crop (next unit).

CLIP — when language joined the party (Radford et al., 2021). Everything above used images only. CLIP uses paired text as the supervision signal. Three motivations for natural-language supervision. *(a)* Category labels don't scale — ImageNet has 22k classes; beyond 10k, getting good per-class labels is infeasible. There's just no way to enumerate all visual concepts. *(b)* Limited descriptive potential — "white shirt" vs "blue shirt" — separate classes? You'd need 10k just for clothing colours. *(c)* Not compositional — "laptop on top of a table" combines two objects and a spatial relationship; no class for every composition. Natural language has none of these limitations, and the internet has billions of image-text pairs for free.

A piece of history — DeViSE (Frome et al., NeurIPS 2013). Deep Visual-Semantic Embedding — the earliest attempt to learn a joint visual-textual embedding space. Pre-dated CLIP by 8 years; lacked the scale. The idea of image-text joint embedding is from 2013; CLIP just made it work at scale.

CLIP architecture. Two encoders, one shared embedding space. **Image encoder $f_{I}$ ** — typically a ResNet or ViT — outputs $I \in R^{d}$ . **Text encoder $f_{T}$ ** — a Transformer over text tokens — outputs $T \in R^{d}$ . Both L2-normalised. Both project to a shared $d_{e}$ -dimensional space via learned $W_{I}, W_{T}$ .

The CLIP loss — symmetric InfoNCE (memorise the pseudocode). Compute features $I_{f}, T_{f}$ ; project and L2-normalise to get $I_{e}, T_{e} \in R^{n \times d_{e}}$ ; logits $= (I_{e} \cdot T_{e}^{⊤}) \cdot exp (t)$ for learned temperature $t$ . The $n \times n$ matrix has *positives on the diagonal* (image $i$ ↔ text $i$ ). Cross-entropy along rows (image→text retrieval) + cross-entropy along columns (text→image retrieval), averaged: $L = \frac{1}{2} (L_{i \to t} + L_{t \to i})$ . SigLIP (later) replaces this softmax with independent sigmoids; CLIP is the softmax-InfoNCE original.

The secret sauce — data, not architecture. CLIP's secret was WIT (WebImageText): 400 million $(ima g e, t e x t)$ pairs scraped from the public internet, constructed using 500k search queries with up to 20k pairs per query for class balance. Word count comparable to GPT-2's training corpus. *CLIP's architectural contribution was modest; its data contribution was unprecedented.* Exam line: *why does CLIP work?* → *scale of natural-language supervision, 400M pairs*.

Zero-shot classification — the magic trick. For a new image classification task: *(1)* embed each class name as text — *"a photo of a {label}"* — via the text encoder; *(2)* embed the input image; *(3)* compute cosine similarity image-vs-each-text; *(4)* predict argmax. No training, no fine-tuning, no labelled images of the new task. CLIP achieves ~76% top-1 zero-shot on ImageNet, competitive with fully-supervised ResNet-50.

CLIP's limitations — the "Why?" slide. *(a)* Not fine-grained — colours, exact part attributes are lost. *(b)* Not compositional — *"person on horse"* vs *"horse on person"* score similarly because CLIP matches global representations, not relational structure. *(c)* Matches global representations — insufficient for spatial relationships. *(d)* Needs hard negatives for compositional reasoning. Two papers diagnose: Winoground (Thrush et al., CVPR 2022) — CLIP performs near chance on captions that differ only in word order with matched swapped images. CLIP Association Bias (Yamada et al., EMNLP 2023, *"When are Lemons Purple?"*) — CLIP associates concepts in unexpected ways via textual co-occurrence (asked about "lemons," sometimes associates purple). These limitations motivate SigLIP, BLIP, and the MLLM line.

CLIP's impact. Its visual encoder is one of the most-used vision encoders today. LAION-5B — open-source 5-billion-pair replication of WIT. BLIP — CLIP-style ideas with added captioning. InstructBLIP — instruction-tuned, uses CLIP's visual encoder + an LLM. Stable Diffusion — uses CLIP's text encoder to condition image generation; *every Stable Diffusion image you see was generated by a model that uses CLIP inside*. SigLIP — directly succeeds CLIP, swaps softmax for sigmoid.

Definitions

Self-supervised learning (SSL) — Supervision derived from structure within the data itself; no human labels. Four families: old-school pretext, contrastive, language-image contrastive, generative.
Positive / negative pair — Positive: two augmented views of the same image. Negative: views from different images. Contrastive methods pull positives together and push negatives apart.
InfoNCE / NT-Xent — Contrastive loss — softmax cross-entropy with a positive logit and many negative logits, scaled by temperature $τ$ . SimCLR's specific form is NT-Xent.
Projection head $g_\phi$ — Small MLP between the encoder $f$ and the contrastive loss. Discarded at downstream. Lets $f$ preserve broad features while $g$ enforces invariances.
Momentum encoder — Slowly-updated copy of the online encoder via EMA $ξ \leftarrow m ξ + (1 - m) θ$ , $m \approx 0.999$ . MoCo's key encoder; BYOL/DINO's target encoder.
Memory queue (MoCo) — FIFO buffer of $K \sim 65$ k past key embeddings. Provides many negatives without large batch size. Updated each step by enqueuing the current batch's keys and dequeuing the oldest.
Predictor (BYOL) — Extra MLP on the online branch only — asymmetry that prevents collapse. Online = $f_{θ} \to g_{θ} \to q_{θ}$ ; target = $f_{ξ} \to g_{ξ}$ (no predictor).
Stop-gradient — Operator that blocks gradients during backprop. BYOL/DINO/JEPA all stop-grad on the target branch — the target is updated via EMA, not gradients.
Sinkhorn-Knopp algorithm (SwAV) — Iterative row/column normalisation that produces an equipartitioned soft cluster assignment. Prevents collapse to one prototype.
WIT (WebImageText) — CLIP's 400M-pair dataset; 500k queries × 20k pairs/query; scraped from the public internet.
Zero-shot classification (CLIP) — Embed class names as text prompts, embed the image, take the argmax cosine similarity. No labelled examples of target classes are seen.
DeViSE — Deep Visual-Semantic Embedding (Frome et al., NeurIPS 2013) — the 2013 precursor of CLIP; introduced image-text joint embedding 8 years earlier, at smaller scale.
Winoground — CVPR 2022 benchmark probing CLIP's compositional reasoning; pairs differ only in word order with matched swapped images. CLIP performs near chance.

Formulas

$Cosine similarity: sim (u, v) = \frac{u ^{⊤} v}{∥ u ∥ ∥ v ∥}$
$NT-Xent (SimCLR): L_{i, j} = - lo g \frac{exp ( sim ( z _{i} , z _{j} ) / τ )}{\sum _{k = 1}^{2 N} , k \neq = i exp ( sim ( z _{i} , z _{k} ) / τ )}$
$MoCo EMA: ξ \leftarrow m ξ + (1 - m) θ, m \approx 0.999$
$BYOL loss: ∥ \overline{q_{θ} (g_{θ} (f_{θ} (v)))} - \overline{sg [g_{ξ} (f_{ξ} (v^{'}))]} ∥^{2} = 2 - 2 cos (online, target)$
$SwAV (Sinkhorn-Knopp soft assignment): Q^{⋆} = ar g Q max Tr (Q^{⊤} C^{⊤} Z) + ε H (Q) s.t. row/col marginals fixed$
$CLIP logits: logits = (I_{e} \cdot T_{e}^{⊤}) \cdot exp (t), I_{e}, T_{e} \in R^{n \times d_{e}} (L2-normalised)$
$CLIP loss: L = \frac{1}{2} (CE (logits, arange (n), axis=0) + CE (logits, arange (n), axis=1))$
$Zero-shot: \overset{y}{^} = ar g c \in C max cos (I_{e}, T_{e}^{‘a photo of a ‘ + c})$

Derivations

NT-Xent ≡ softmax cross-entropy. With one positive logit $s^{+}$ and $K - 1$ negative logits ${s_{j}}$ , softmax cross-entropy for the positive class is $- lo g (e^{s^{+}} / \sum_{j} e^{s_{j}}) = - s^{+} + lo g \sum_{j} e^{s_{j}}$ . NT-Xent is exactly this with $s_{j} = sim (z_{i}, z_{j}) / τ$ and the positive picked by augmentation pairing. *Same equation, two names.*

**Why a projection head $g_{ϕ}$ improves downstream.** The contrastive loss is *aggressive* — it forces $z_{i} \approx z_{j}$ for augmentation pairs, which removes information that's invariant to those augmentations (colour, position). If you put the loss directly on $h$ , the encoder $f$ has to throw away colour and position too. Inserting $g$ between $f$ and the loss lets $f$ preserve broad image information; $g$ learns a contrastive-specific subspace where invariances are enforced. *Discard $g$ at downstream → recover the full $f$ .* Empirical gain: 5–10% on ImageNet linear probe.

Why MoCo's momentum encoder gives queue consistency. Without EMA, the queue would hold keys produced by many *different* parameter snapshots over thousands of past iterations — inconsistent neighbours in feature space, training noise dominates. With $ξ \leftarrow 0.999 ξ + 0.001 θ$ , $ξ$ changes at ~0.1%/step; after the queue length (e.g. 65k steps), $ξ$ has changed by ~ $1 - 0.99 9^{65000} \approx 1$ , but the change is *smooth* — adjacent queue entries come from nearly-identical $ξ$ . Smoothness, not slowness per se, is what matters.

Why BYOL's stop-gradient is essential. Without stop-gradient, gradients flow symmetrically through both branches; the system can trivially minimise the loss by making both networks output the same constant — *representation collapse*. The stop-gradient + EMA forces the target to be a (delayed) function of the online, not jointly optimised with it; the asymmetric predictor then has to *learn* to match, not just copy.

The CLIP logits matrix and its symmetric CE. $I_{e} \cdot T_{e}^{⊤}$ is $n \times n$ — entry $(i, j)$ is cosine similarity between image $i$ and text $j$ . *Row $i$ *: softmax over $j$ → "which text caption matches image $i$ ?" — image-to-text. *Column $j$ *: softmax over $i$ → "which image matches text $j$ ?" — text-to-image. Both cross-entropies are minimised when the *diagonal* is the argmax in every row and column. Averaging gives the symmetric loss.

Examples

SimCLR augmentation ablation. Random crop + colour jitter + Gaussian blur gives the best representation. Ablating *colour jitter* hurts more than ablating any other single augmentation — the model otherwise shortcuts on consistent colour statistics between the two views.
SimCLR batch-size dependency. Batch 256 → ~64% ImageNet linear probe. Batch 4096 → ~70%. Batch 8192 → ~76%. The InfoNCE denominator's size is the bottleneck.
MoCo at batch 256. With queue $K = 65, 536$ and momentum $m = 0.999$ , MoCo matches SimCLR at batch 4096 — *order-of-magnitude memory savings*.
BYOL collapse without stop-gradient. Run BYOL with grad flowing through the target branch → after a few hundred steps, both networks output the same constant vector, loss = 0, useless features.
Zero-shot ImageNet with CLIP. Build the prompt set ${$ "a photo of a goldfish", "a photo of a tiger shark", $\dots$ , "a photo of a yellow lady's slipper" $}$ (1000 prompts). Encode all. Embed an image. Cosine similarity → argmax. 76.2% top-1 zero-shot, matching a fully-supervised ResNet-50.
CLIP retrieval matrix. Batch of 32 images and their captions. Compute the $32 \times 32$ similarity matrix. Sum the diagonal entries divided by row-sums → batch-level recall@1.
Winoground failure. "A dog on a horse" vs "a horse on a dog" — two captions, two correct matched images. CLIP scores both with near-identical cosines; accuracy at distinguishing them is barely above chance.

Diagrams

SimCLR pipeline. Image $x$ → two augmentations $t, t^{'}$ → two views $x_{i}, x_{j}$ → shared encoder $f$ → projection $g$ → $z_{i}, z_{j}$ → NT-Xent against the other $2 N - 2$ views.
MoCo pipeline. Query encoder + slowly-updated momentum key encoder + memory queue of past keys. Query $q$ matches positive $k^{+}$ against the queue's negatives. Backprop updates query only; queue updates by EMA.
BYOL pipeline. Online (encoder + projector + predictor) on top branch; target (encoder + projector, no predictor) on bottom branch with stop-gradient; predicts $sg [target output]$ .
SwAV with prototypes. Two views → encoder → features $z, z^{'}$ ; each compared against $K$ prototype vectors; Sinkhorn-Knopp gives equipartitioned soft assignments $q, q^{'}$ ; cross-entropy swaps: predict $q^{'}$ from $z$ and $q$ from $z^{'}$ .
CLIP training matrix. $n \times n$ image-text cosine matrix; diagonal positives; rows = image-to-text CE; columns = text-to-image CE; loss = average.
CLIP zero-shot inference. $K$ class names → $K$ text embeddings; one image → one image embedding; cosine similarities → argmax → predicted class.

Edge cases

SimCLR with batch 256 underperforms supervised. Needs batch 4k+ for enough negatives — the denominator is the bottleneck.
BYOL without stop-gradient collapses to constant. Both networks output the same vector; loss = 0; features useless. Confirms the stop-gradient is the load-bearing element.
CLIP fine-grained classification (dog breeds). Zero-shot drops sharply on Stanford Dogs vs ImageNet — global representations don't separate "golden retriever" from "yellow Labrador".
CLIP compositional reasoning fails on Winoground; on relational scenes ("laptop *on* a table" vs "laptop *under* a table"), accuracy is near chance.
MoCo queue staleness. If $m$ is too low (e.g. $m = 0.9$ ), queue keys are from rapidly-changing encoders → inconsistent → poor features. $m = 0.999$ is the standard sweet spot.
SwAV cluster collapse without Sinkhorn. Without the equipartition constraint, all views map to a single prototype; Sinkhorn is the load-bearing piece.
CLIP prompt engineering. "a photo of a {}" works; "a {} " works less well; multi-prompt ensembles ("a photo of a {}", "a sketch of a {}", "an image of a {}", ...) give 1–3% extra zero-shot accuracy. CLIP is sensitive to text prompts.

Common mistakes

Stating SimCLR needs no negatives — it does (BYOL is the one without).
Keeping the projection head $g_{ϕ}$ for downstream and discarding $f$ — the opposite is correct.
Treating CLIP's loss as one-directional CE — it's symmetric over rows AND columns.
Calling CLIP zero-shot *"few-shot"* — no labelled examples of the target classes are seen.
Saying MoCo's momentum encoder is *trained by backprop* — it's not; only by EMA of the online encoder.
Confusing BYOL's *predictor* with MoCo's *momentum encoder* — different mechanisms (asymmetric MLP on online vs slow EMA on the key side).
Stating *"CLIP was a new architecture"* — its architecture was modest; the contribution was the 400M-pair WIT data.
Writing the InfoNCE denominator with only negatives — it includes the positive too (the positive logit appears in both numerator and denominator).

Shortcuts

Four ingredients of SimCLR: *augmentation → encoder $f$ → projection $g$ → NT-Xent*. Discard $g$ at downstream.
MoCo memorisation: queue + momentum encoder. Decouples #negatives from batch size.
BYOL memorisation: predictor + stop-gradient + EMA. No negatives. Three required for no collapse.
SwAV memorisation: $K$ prototypes + Sinkhorn-Knopp equipartition + swap cluster assignments.
NT-Xent = softmax cross-entropy with similarity logits and the positive index as the "label".
**CLIP loss = $\frac{1}{2}$ (row CE + column CE)**, symmetric.
Zero-shot template: *"a photo of a {class}"* + cosine + argmax. Memorise.
WIT scale: 400M pairs, 500k queries, 20k pairs/query.

Proofs / Algorithms

NT-Xent equals softmax cross-entropy. Define logits $ℓ_{k} = sim (z_{i}, z_{k}) / τ$ . The softmax-CE for a positive at index $j$ is $- lo g p_{j} = - lo g (e^{ℓ_{j}} / \sum_{k} e^{ℓ_{k}}) = - ℓ_{j} + lo g \sum_{k} e^{ℓ_{k}}$ . This is exactly NT-Xent's $L_{i, j} = - lo g (e^{sim (z_{i}, z_{j}) / τ} / \sum_{k} e^{sim (z_{i}, z_{k}) / τ})$ .

BYOL's anti-collapse intuition. With stop-grad and EMA, the target is a delayed function of the online; the predictor $q_{θ}$ must learn an explicit map online → target. If both branches output the same constant, the predictor sees no signal — but minimising $∥ q_{θ} (online) - sg [target] ∥^{2}$ over data requires $q_{θ}$ to be informative; under suitable initialisation + augmentation, the dynamics converge to a non-trivial fixed point. (Formal analyses by Tian et al. show that the predictor's eigenstructure prevents collapse.)

Why bigger SimCLR batches help. InfoNCE is variational lower bound on mutual information; the bound becomes tight only as the number of negatives grows. Concretely, $lo g K$ enters as an upper bound on the MI estimate, so increasing $K$ (negatives) raises the achievable bound. Empirically, performance saturates around $K = 8$ k for SimCLR; MoCo reaches the same with $K = 65$ k from the queue.

Computer Vision