Revision Notes/Unit 9 — SSL: DINO, MAE, JEPA/DINO, MAE, JEPA — Modern SSL Beyond Contrastive

DINO, MAE, JEPA — Modern SSL Beyond Contrastive

Intuition

The previous unit's contrastive recipe — pull positives, push negatives — works but has problems: it needs many negatives (big batches or memory banks), is sensitive to augmentation choice, and produces good-but-not-great features. In 2021, three papers in quick succession asked: do we even need negative samples? One by one they showed you don't. DINO (self-distillation), MAE (masked reconstruction in pixel space), and JEPA (prediction in representation space) form the second great family of SSL alongside contrastive methods.

Explanation

DINO — self-DIstillation with NO labels (Caron et al., ICCV 2021). Take a ViT. Make *two copies*. Student $f_{θ_{s}}$ — trained with backprop. Teacher $f_{θ_{t}}$ — identical architecture, *different weights, no backprop*. Both output a probability distribution over $K = 65, 536$ dimensions (a learned codebook) via softmax: $p_{s} (x) = softmax (f_{θ_{s}} (x) / τ_{s})$ , $p_{t} (x) = softmax (f_{θ_{t}} (x) / τ_{t})$ .

DINO loss — teacher's distribution as a soft target. The student learns to match the teacher's output distribution for the same input image (under different augmentations): $L (θ_{s}) = - \sum_{k} p_{t} (x)_{k} lo g p_{s} (x)_{k}$ . This is *cross-entropy between teacher and student distributions*. Gradients flow only through the student. The teacher acts as a soft-label supplier; the student tries to reproduce the teacher's belief.

Teacher update — EMA of the student. $θ_{t} \leftarrow λ θ_{t} + (1 - λ) θ_{s}$ , with $λ$ on a cosine schedule from $0.996 \to 1.0$ . Early training: $λ \approx 0.996$ → teacher updates relatively quickly (4 parts per thousand of student per step). Late training: $λ \to 1$ → teacher freezes. The student is always trying to imitate a *slightly delayed, smoothed version of itself* — small training noise is averaged out. Directly borrowed from MoCo / BYOL.

Multi-crop strategy — DINO's signature. From a single input image, generate multiple crops at two scales. 2 "global" views — large crops (>50% of the image), $224 \times 224$ pixels, fed to both the teacher and the student. 6–10 "local" views — small crops (<50%), $96 \times 96$ pixels, fed to the student only. Compute cross-entropy for every (teacher-global, student-other) pair and sum. The teacher only sees big crops; the student sees both. So the student is forced to predict a global picture's distribution from a small local crop — *"this little patch of grass corresponds to the same scene as that wide image of the meadow."* This is what gives DINO its remarkable local-to-global consistency.

The collapse problem — the central technical danger. Trivial solution: student and teacher both output the same *constant* vector for every image → loss minimised, all signal destroyed. Two specific failure modes: *(a)* single-dim domination — softmax always peaks on the same component $k^{⋆}$ regardless of input; *(b)* uniform output — $p_{t} (x) = (1/ K, \dots, 1/ K)$ for every input. Both kill the signal. DINO must prevent both.

Trick 1 — Centering (prevents single-dim collapse). Subtract a running-mean bias $c$ from teacher logits before softmax: $g_{θ_{t}} (x) = f_{θ_{t}} (x) - c$ ; $c \leftarrow m \cdot c + (1 - m) \cdot mean (f_{θ_{t}} (batch))$ . $c$ is updated by EMA over the batch. This discourages any one logit dimension from systematically dominating. Slide line: "centering done through bias term added to logits." Without it, the network would learn to output the same peaked distribution regardless of input.

Trick 2 — Sharpening (prevents uniform collapse). The teacher uses a *very low* temperature $τ_{t} \approx 0.04$ , much smaller than the student's $τ_{s} \approx 0.1$ . A small temperature sharpens the softmax — making the teacher's output close to one-hot. If the teacher's outputs were uniform, the soft labels would carry no signal; forcing the teacher to be *confident* (peaked) ensures labels are always informative. Slide line: "sharpening done using low value of temperature."

Why both tricks are needed (the central insight). *Centering alone* → uniform collapse (everything looks flat). *Sharpening alone* → single-dim domination (everything peaks on the same dim). Together → neither fails, training is stable. Memorise as opposing forces: centering spreads, sharpening peaks.

DINO headline results. *78.3% top-1 ImageNet accuracy using just k-NN on frozen features* (no fine-tuning, no linear probe). *76.1% ImageNet linear probe* after training on 2 × 8-GPU servers for 3 days. Works on both CNNs and ViTs. The killer property: DINO ViTs spontaneously learn object-level attention maps — the $[CLS]$ token's attention produces usable segmentation-like masks with no segmentation supervision. *Segmentation emerges from self-supervision* — what made DINO famous beyond benchmarks.

DINOv2 (Oquab et al., 2023). Scaled DINO to 142M curated images with a more carefully engineered training pipeline. Became the standard pretrained vision encoder for dense prediction (depth, segmentation, retrieval). Many modern multimodal models use DINOv2 features (e.g. OpenVLA uses DINOv2 + SigLIP).

DINOv2 + Registers (Darcet et al., 2023). Adds register tokens — extra learnable tokens prepended to the sequence with no positional meaning. The model uses them as a scratchpad for global information, freeing real patch tokens from acting as scratchpad → cleaner attention maps and better dense prediction. The "garbage collector tokens" idea.

MAE — Masked Autoencoder (He, Chen et al., CVPR 2022). If DINO is vision's BYOL, MAE is vision's BERT. BERT pretrains language models by masking ~15% of tokens and asking the model to predict them. Naïvely doing the same for vision — mask 15% of image patches and reconstruct — doesn't work well. The model *shortcuts* by copying texture from neighbouring patches without learning any real semantics. Vision is more redundant than language: a missing patch is almost always predictable from its neighbours' colours and textures.

MAE's resolution — mask aggressively. The lecture line: *"BERT-like LMs mask 15% of the tokens. MAEs choose to remove 75% of the image tokens."* And: *"Make pretraining difficult, otherwise model will shortcut and not learn meaningful stuff."* Forcing reconstruction of 75% from only 25% visible patches makes the task genuinely hard — there's no local-texture shortcut. The model is forced to learn global structure.

The asymmetric encoder-decoder architecture (MAE's efficiency trick). *(1)* Patch the image (ViT-style): $N$ patches. *(2)* Randomly mask 75% → keep only 25% visible. *(3)* Pass the 25% visible patches through a deep ViT encoder — *the encoder never sees the mask tokens*. *(4)* Insert learnable mask tokens at masked positions, each with its positional embedding. *(5)* Concatenate visible-encoded tokens + mask tokens. *(6)* Pass through a small, lightweight ViT decoder. *(7)* Decoder predicts the *pixels* of the masked patches.

Why the asymmetric architecture matters. The encoder only ever sees 25% of the tokens. *Encoder cost drops by ~4×* — you're paying compute for only a quarter of the sequence. The encoder learns to encode the visible signal into something useful, with no need to think about masked positions during the forward pass. The decoder is lightweight — it does the reconstruction, then is thrown away after pretraining. Only the encoder is kept; the decoder is task-specific scaffolding.

MAE loss — pixel-space MSE on masked patches only. $L_{MAE} = \frac{1}{∣ M ∣} \sum_{i \in M} ∥ x_{i} - \overset{x}{^}_{i} ∥_{2}^{2}$ where $M$ is the set of masked patches, $x_{i}$ is the original patch, $\overset{x}{^}_{i}$ is the decoder's prediction. No loss on visible patches (they were given as input).

The masking-ratio ablation. Performance peaks at 75% masking, much higher than BERT's 15%. Memorise this number — exam fodder. Why high masking helps in vision but not language: language tokens carry high information density per token (entire words/concepts); 15% already creates a hard task. Image patches are highly redundant — a $16 \times 16$ patch shares textures with its neighbours. You need to remove most of them to force genuine learning.

Why MAE matters. Simple, scalable, effective. After pretraining, the encoder ViT transfers to detection, segmentation, classification — competitive with or better than supervised pretraining at scale. The most successful *generative* SSL approach for vision; one of the standard pretrained encoders for transfer learning.

JEPA — Joint-Embedding Predictive Architecture (Yann LeCun's program). Both DINO and MAE have hidden costs. DINO learns through pixel-augmentation invariance — but augmentations are *handcrafted choices*. MAE learns by pixel reconstruction — but most pixel-level detail (texture, lighting, noise) is *irrelevant for high-level semantics*. LeCun's argument: why predict pixels at all? The hard, semantically meaningful task is to predict the *representation* of a missing region, not the literal RGB values.

I-JEPA architecture (Assran et al., CVPR 2023). Three components. *(1)* **Context encoder $f_{θ}$ ** — sees a context block (a portion of the image) and produces representations. *(2)* **Target encoder $f_{ξ}$ — sees target blocks (other portions of the image) and produces target representations. Trained via EMA of $θ$ ** (just like DINO's teacher); no gradients. *(3)* **Predictor $g_{ϕ}$ — given the context representations and the spatial positions of the target blocks, predicts the target representations. Loss: L2 distance in representation space**, with stop-gradient on the target encoder: $L = ∥ g_{ϕ} (ctx pos, targets) - sg [f_{ξ} (target)] ∥^{2}$ .

MAE vs JEPA — the comparison table. *What is predicted:* pixels of masked patches vs representations of masked patches. *Loss:* pixel-space MSE vs representation-space L2. *Target source:* the original image vs the target encoder (separate network, EMA). *Wastes capacity on:* textures, noise, exact colours vs nothing — only semantic content. *Computationally expensive part:* the pixel decoder vs the target encoder. JEPA's claim: by predicting in abstract representation space, the network doesn't waste capacity on irrelevant pixel-level details.

JEPA variants. I-JEPA (Image, 2023) — the original. V-JEPA (Video, 2024) — extends to video; predicts representations of masked spatio-temporal regions. VL-JEPA (Vision-Language, 2025) — adds a language modality. Research frontier; you should know they exist and what each means.

The big-picture summary slide — Vision SSL today. *Old-school pretext tasks* (jigsaw, colorisation, inpainting); *Contrastive* (SimCLR, MoCo, CLIP); *Self-distillation* (DINO, BYOL, MoCo — no negatives); *Image-only vs Image-language pretraining*; *Generative* (MAE); *JEPAs* (predict representations, not pixels). All modern vision encoders are SSL-pretrained.

Definitions

Self-distillation — Student trained to match teacher's output distribution; teacher and student share architecture; teacher updated via EMA of student. Cross-entropy loss. No negatives.
EMA teacher update — $θ_{t} \leftarrow λ θ_{t} + (1 - λ) θ_{s}$ with cosine schedule $λ : 0.996 \to 1$ . Teacher is a slowly-updated, smoothed version of the student.
Centering (DINO) — Subtract a running-mean bias from teacher logits before softmax. Prevents collapse to a single-dimension-dominated output. "Bias term added to logits."
Sharpening (DINO) — Apply a very low temperature ( $τ_{t} \approx 0.04$ ) to teacher logits before softmax. Produces a peaky, confident target. Prevents collapse to uniform output.
Multi-crop — DINO's augmentation strategy: 2 global views (>50% area, 224 px) fed to both teacher and student + 6–10 local views (<50%, 96 px) fed to student only. Forces local-to-global consistency.
$[\text{CLS}]$ attention as emergent segmentation — In DINO-pretrained ViT, $[CLS]$ 's attention over patches concentrates on the salient object — producing usable segmentation-like maps with zero segmentation supervision.
Registers (DINOv2) — Extra learnable tokens prepended to the sequence with no positional embedding; absorb global scratchpad activity so real patch tokens have cleaner attention maps.
Masked Autoencoder (MAE) — Patchify image → randomly mask 75% → deep encoder on visible 25% only → light decoder reconstructs masked-patch pixels via MSE. Asymmetric architecture; encoder kept, decoder discarded.
Mask ratio (MAE) — Fraction of patches masked. 75% in MAE vs 15% in BERT — images have more spatial redundancy, so higher masking is required to prevent texture-copying shortcut.
JEPA (Joint-Embedding Predictive Architecture) — LeCun's program: predict target *representations* (from a separate EMA-updated target encoder) rather than pixels. Context encoder + target encoder + predictor; L2 in feature space with stop-gradient on target.
I-JEPA / V-JEPA / VL-JEPA — Image JEPA (Assran et al., CVPR 2023) / Video JEPA (2024) / Vision-Language JEPA (2025). Same recipe, different modalities.
Stop-gradient — Operator that blocks gradients during backprop. DINO/JEPA/BYOL all stop-grad the target branch — the target is updated via EMA, not gradients.

Formulas

$DINO student/teacher softmax: p_{s} (x) = softmax (f_{θ_{s}} (x) / τ_{s}), p_{t} (x) = softmax (\frac{f _{θ_{t}} ( x ) - c}{τ _{t}})$
$DINO loss: L (θ_{s}) = - k \sum p_{t} (x)_{k} lo g p_{s} (x)_{k}$
$Teacher EMA: θ_{t} \leftarrow λ θ_{t} + (1 - λ) θ_{s}, λ : 0.996 \to 1 (cosine)$
$Centering: c \leftarrow m c + (1 - m) mean (f_{θ_{t}} (batch))$
$Temperatures: τ_{s} \approx 0.1, τ_{t} \approx 0.04 (sharpening)$
$MAE loss: L_{MAE} = \frac{1}{∣ M ∣} i \in M \sum ∥ x_{i} - \overset{x}{^}_{i} ∥_{2}^{2}, masked patches only$
$MAE mask ratio: 75% (vs BERT’s 15%)$
$JEPA loss: L_{JEPA} = ∥ g_{ϕ} (z_{ctx}, positions) - sg [f_{ξ} (target)] ∥_{2}^{2}$
$Output dim: K = 65, 536 (DINO codebook)$

Derivations

Why MAE masks 75% and BERT only 15%. Language tokens carry distinct semantic content per token; masking 15% leaves enough redundancy for the task to be non-trivial but not impossible. Image patches are far more redundant *spatially* — neighbouring $16 \times 16$ patches share textures and colours. At 15% mask, the model essentially copies nearby patches; at 75%, interpolation fails and only semantic understanding can complete the reconstruction. The ablation curve peaks at 75% — too little masking shortcuts, too much loses signal.

Why centering alone causes uniform collapse. Centering subtracts the running mean from teacher logits, equivalent to enforcing $E_{x} [f_{θ_{t}} (x)] = 0$ . The softmax of zero-mean logits at *any* finite temperature is biased toward uniform (zero logits → uniform output). Without an opposing force, the teacher's distribution flattens, the cross-entropy carries no signal, and features become meaningless.

Why sharpening alone causes single-dim domination. Low temperature $τ_{t}$ exponentiates differences; if all images have approximately the same logit pattern (because the teacher hasn't been trained to discriminate), the softmax peaks on the dominant dimension for every input. The teacher outputs the same near-one-hot regardless of input — collapse on one dimension. The fix: add centering to spread the distribution.

Why MAE's encoder cost drops by ~4× under 75% masking. Standard ViT processes $N$ patches at attention cost $O (N^{2})$ ; MAE encoder sees only $N /4$ patches → $O (N^{2} /16)$ attention compute, plus $O (N /4)$ MLP. Wall-clock encoder speedup ~4× because attention is the dominant cost at ViT-B/16 sequence lengths. The lightweight decoder operates on the full sequence but at much smaller depth (8 layers, narrower hidden dim) so total wall-clock still drops.

Why JEPA's L2-in-representation-space doesn't trivially collapse. Like BYOL, JEPA uses stop-gradient on the target encoder and EMA updates → asymmetric system. The predictor must *learn* the explicit map context → target, not just copy. Empirically the dynamics avoid collapse for the same reasons BYOL does. KoLeo regulariser (uniformity in feature space) and explicit anti-collapse terms further stabilise some implementations.

Examples

DINO ViT-B/16 training run. 2 × 8-GPU servers, 3 days, ImageNet-1k. Output: 78.3% k-NN top-1 (no probe), 76.1% linear probe. Comparable to a fully-supervised ResNet-50 in accuracy, with *no labels*.
Multi-crop counts. 2 global views (224²) + 8 local views (96²) = 10 student forwards, 2 teacher forwards per image. The teacher's expensive forward runs only on the big crops.
Centering update. Batch of 256 images, $K = 65, 536$ teacher logits each. $c$ is a $K$ -dim vector updated as $c \leftarrow 0.9 c + 0.1 \cdot mean_{batch} (f_{θ_{t}} (x))$ . After training, $c$ stabilises around the mean teacher logit, removing bias.
Sharpening illustration. Teacher logits $[5, 4, 3]$ . $τ_{t} = 1$ : softmax $\approx [0.67, 0.24, 0.09]$ . $τ_{t} = 0.04$ : divide by 0.04 → $[125, 100, 75]$ → softmax $\approx [1.0, 1.4 \times 1 0^{- 11}, \dots]$ . Effectively one-hot. *Same logits, very different distributions.*
MAE encoder cost example. ViT-B/16 on $22 4^{2}$ → $N = 196$ patches. MAE encoder sees $0.25 \cdot 196 = 49$ patches. Attention compute scales as $4 9^{2} /19 6^{2} = 1/16$ — about 16× fewer FLOPs per layer than full ViT, ~4× wall-clock speedup including MLP.
MAE reconstruction quality. Masked patches reconstructed by MAE are *blurry but globally coherent* — the model captures structure (object outline, lighting) but not fine texture. *That's the point* — the encoder learns semantic features, not pixel-perfect texture.
Registers in DINOv2. 4 extra learnable tokens, no positional embedding, prepended to the patch sequence. Pre-register: heatmaps show puzzling "high attention on sky" — patch tokens being misused as scratchpad. Post-register: register tokens absorb these patterns; patch tokens have clean, object-centric attention maps.
I-JEPA training. Pre-compute target representations on a batch via $f_{ξ}$ ; mask the original image to leave context; encode context via $f_{θ}$ ; predictor takes (context features, target positions) → predicted target features; L2 loss between predicted and stop-gradded target features.

Diagrams

DINO architecture. Two side-by-side ViTs — student (backprop) and teacher (EMA). 2 global crops fed to both, 8 local crops fed to student only. Teacher logits → subtract $c$ → divide by low $τ_{t}$ → softmax → soft target. Student logits → divide by $τ_{s}$ → softmax. Cross-entropy loss, gradient through student only.
The two collapse modes. Side-by-side bar charts of the teacher's $K$ -dim output: *single-dim collapse* (one bar at 1.0, all others 0) vs *uniform collapse* (all bars at $1/ K$ ). Annotate which trick prevents which.
MAE pipeline. Image → patchify (196 tokens for ViT-B/16) → randomly drop 75% (49 visible) → deep ViT encoder on 49 tokens → re-insert 147 mask tokens at positions → lightweight decoder on all 196 → reconstruct pixels of the 147 masked patches only.
JEPA architecture. Image → split into context block + target blocks. *Context encoder $f_{θ}$ * → $z_{ctx}$ . *Target encoder $f_{ξ}$ (EMA, stop-grad)* → $z_{target}$ . *Predictor $g_{ϕ} (z_{ctx}, target positions)$ * → $\overset{z}{^}_{target}$ . L2 in representation space between $\overset{z}{^}_{target}$ and $sg [z_{target}]$ .
MAE vs JEPA target spaces. Visual: MAE's loss is on the *image grid* (pixel reconstruction); JEPA's loss is on the *feature grid* (representation prediction). Highlight that MAE wastes capacity on texture; JEPA doesn't.
[CLS] attention emergent segmentation. Trained DINO ViT-B/16; $[CLS]$ 's attention over the 196 patches visualised as a heatmap → outlines the salient object. Compare to a supervised-pretrained ViT-B/16 where $[CLS]$ attention is diffuse.

Edge cases

Without centering OR without sharpening, DINO collapses. Ablating either trick alone produces the corresponding collapse mode. Both are essential — exam classic.
**Late-training $λ \to 1$ ** freezes the teacher. If $λ$ stays at 0.996 too long, the teacher remains too noisy; if $λ$ ramps to 1 too early, the teacher stops improving and limits student gains.
MAE pixel loss can be dominated by low-level texture. Some variants (HOG targets, perceptual loss) trade reconstruction fidelity for semantic features.
JEPA latent collapse. Predictor could output a constant target embedding; mitigated by stop-grad + EMA + (sometimes) explicit anti-collapse regularisers like KoLeo.
Multi-crop ratio matters. Too few local crops (< 4) underdrives the local-to-global signal; too many (> 12) over-stresses the small-context branch and slows training.
Mask token leakage in MAE. If positional embeddings aren't strong enough, the decoder cannot distinguish between mask positions → reconstruction degrades. Always re-add positional embeddings before the decoder.

Common mistakes

Stating DINO uses negatives — it does not; it's self-distillation, no contrastive negatives.
Confusing DINO's EMA teacher with MoCo's momentum encoder — different mechanisms (teacher of self-distillation vs key encoder of contrastive loss), similar mathematical form.
Saying MAE uses BERT-like 15% mask — it's 75%.
Treating JEPA as "MAE in latent space" — JEPA explicitly predicts EMA-target representations, not reconstructed pixels. Different architecture and different anti-collapse mechanism.
Claiming MAE's decoder is kept after pretraining — no, only the encoder is retained for downstream tasks.
Stating DINO's anti-collapse is one trick — it's two: centering AND sharpening. Either alone causes the opposite collapse mode.
Writing $τ_{t} > τ_{s}$ — backwards. $τ_{t} \approx 0.04 < τ_{s} \approx 0.1$ (teacher is *sharper*).

Shortcuts

DINO anti-collapse pair: centering + sharpening ( $τ_{t} \approx 0.04$ ).
Teacher EMA schedule: $λ : 0.996 \to 1$ on a cosine schedule. Borrowed from MoCo/BYOL.
Multi-crop: 2 globals (>50%, 224²) to teacher + student; 6–10 locals (<50%, 96²) to student only.
DINO output dim: $K = 65, 536$ — large softmax space discourages collapse on one dimension.
MAE asymmetry: encoder sees 25% of patches (deep ViT); decoder sees 100% (mask tokens + visible features, lightweight). Encoder kept, decoder discarded.
MAE mask ratio: 75%. Quote the slide: *"make pretraining difficult, otherwise model will shortcut."*
JEPA = predict representations, not pixels. Context encoder (trained) + target encoder (EMA, frozen) + predictor.
JEPA variants: I-JEPA (image, 2023), V-JEPA (video, 2024), VL-JEPA (vision-language, 2025).

Proofs / Algorithms

Centering removes mean bias from teacher logits. Let $c_{t} = E_{batch} [f_{θ_{t}} (x)]$ (the running mean). Subtracting $c$ before softmax gives logits $ℓ = f_{θ_{t}} (x) - c$ with $E [ℓ] = 0$ at convergence. The softmax of zero-mean logits *averages* to uniform, but for any particular $x$ the output depends on $ℓ$ 's deviation from zero — preserving discrimination while preventing the same single dimension from systematically being the argmax across all inputs.

MAE's loss-on-masked-only argument. Including visible patches in the reconstruction loss is redundant — they were given as input, so the decoder can trivially copy them. Restricting the loss to masked patches forces the model to spend capacity on the actually informative part (inferring hidden content from context) rather than the trivial part (memorising what it was told).

JEPA's stop-gradient prevents trivial collapse. Without stop-grad, both encoders would be jointly optimised by gradients, and the trivial fixed point $f_{θ} = f_{ξ} = const$ minimises L2 perfectly. Stop-grad makes the target $f_{ξ}$ a *delayed function of $θ$ * (via EMA); the predictor must learn the explicit map context → target, which is non-trivial whenever the contexts and targets are sampled from different image regions.

Computer Vision