Revision Notes/Unit 9 — SSL: DINO, MAE, JEPA/DINO, MAE, JEPA — Modern SSL Beyond Contrastive/Story

DINO, MAE, JEPA — Modern SSL Beyond Contrastive

Unit 9 — SSL: DINO, MAE, JEPA

The Student Becomes the Teacher

The previous unit taught you contrastive self-supervised learning — SimCLR, MoCo, CLIP. The recipe was always the same: bring positive pairs close, push negative pairs apart. It works, but it has problems. Contrastive learning needs negative samples (lots of them — large batches or memory banks), it's sensitive to augmentation choice, and the resulting features are good but not great.

Then in 2021, three papers, in quick succession, asked the same question: do we even need negative samples? And one by one they showed you don't. You just need smarter ways of looking at the same image.

This unit covers three of those approaches:

DINO — self-distillation: the network teaches itself, no negatives at all.
MAE — masked autoencoder: vision's answer to BERT.
JEPA — predict representations, not pixels. The 2023+ frontier.

All three are generative or self-distilling, not contrastive. They form the second great family of SSL alongside contrastive methods, and together they're what every modern vision encoder is pretrained with.

Part 1 — DINO: self-distillation with no labels

DINO (Caron et al., ICCV 2021, *"Emerging Properties in Self-Supervised Vision Transformers"*) stands for "self-DIstillation with NO labels."

Two copies of the same network

Take a ViT. Make two copies of it:

A student network $f_{θ_{s}}$ — trained with backpropagation.
A teacher network $f_{θ_{t}}$ — *identical architecture, different weights, no backpropagation*.

Both networks output a probability distribution over $K$ dimensions (a learned codebook, typically $K = 65, 536$ ). They use a softmax on top of the ViT's output:

p_{s} (x) = softmax (f_{θ_{s}} (x) / τ_{s}), p_{t} (x) = softmax (f_{θ_{t}} (x) / τ_{t})

with $τ_{s} \approx 0.1$ and $τ_{t} \approx 0.04$ (much smaller — the teacher is *sharper*).

The loss — teacher's distribution as a soft target

The student learns to match the teacher's output distribution for the same input image (under different augmentations):

L (θ_{s}) = - k \sum p_{t} (x)_{k} lo g p_{s} (x)_{k}

This is cross-entropy between teacher and student distributions. The teacher acts as a "soft label" supplier, and the student tries to reproduce the teacher's belief. Gradients flow only through the student. The teacher is not trained by backprop.

How is the teacher updated?

EMA — exponential moving average — borrowed from MoCo/BYOL:

θ_{t} \leftarrow λ θ_{t} + (1 - λ) θ_{s}

with $λ$ on a cosine schedule from $0.996 \to 1.0$ .

Early training: $λ \approx 0.996$ — teacher updates relatively quickly.
Late training: $λ \to 1.0$ — teacher is frozen.

The student is always trying to imitate a slightly *delayed, smoothed* version of itself. Small training noise is averaged out.

The multi-crop strategy

Here's where DINO gets clever. From a single input image, generate multiple crops at two scales:

2 "global" views — large crops (>50% of the image area), $224 \times 224$ pixels. Fed to both teacher and student.
6–10 "local" views — small crops (<50%), $96 \times 96$ pixels. Fed to the student only.

For each (global view → teacher, all-other-views → student) pair, compute cross-entropy. Sum across all pairs.

The teacher only sees big crops. The student sees both big and small. So the student is forced to predict a *global picture's distribution from a small local crop* — "this little patch of grass corresponds to the same scene as that wide image of the meadow."

This is what gives DINO its remarkable property: features that are simultaneously consistent across scales. From a patch you can recover the global semantics.

The collapse problem — the central technical danger

A trivial solution exists: the student and teacher both output the same constant vector for every image. Then the loss is trivially minimised. This is mode collapse — the network throws away all useful information.

Two failure modes:

1. Single-dimension domination — the softmax always peaks on the same component $k^{⋆}$ , regardless of input. 2. Uniform output — $p_{t} (x) = (1/ K, \dots, 1/ K)$ for every input.

Both kill the signal. DINO must prevent both. The lecture states the two fixes explicitly — this is the exam question waiting to happen.

Trick 1 — Centering (prevents single-dim domination)

The teacher subtracts a running mean from its logits before the softmax:

g_{θ_{t}} (x) = f_{θ_{t}} (x) - c, c \leftarrow m c + (1 - m) mean (f_{θ_{t}} (batch))

$c$ is an additive bias term, updated by EMA over the batch. This subtracts off the running mean of the teacher's outputs — equivalently, it *discourages any one logit dimension from systematically dominating*.

Slide line: *"centering done through bias term added to logits."*

Trick 2 — Sharpening (prevents uniform collapse)

The teacher uses a very low temperature $τ_{t} \approx 0.04$ . A small temperature *sharpens* the distribution — making the teacher's output close to a one-hot vector. If the teacher's outputs were uniform, the soft labels would carry no signal. By forcing the teacher to be confident, the labels are always informative.

Slide line: *"sharpening done using low value of temperature."*

Two opposing forces

Centering alone → uniform collapse (everything flat).
Sharpening alone → single-dim domination (everything peaks on the same dim).
Together → neither fails, training is stable.

Memorise the two fixes as opposing forces. This is the central insight of DINO.

The headline results

78.3% top-1 ImageNet accuracy using just k-NN on DINO features (no probe).
76.1% ImageNet linear probe after training on 2 × 8-GPU servers for 3 days.
Works on both CNNs and ViTs.
The features have emergent properties: DINO ViTs spontaneously learn to attend to object-level structures in images, producing usable segmentation-like attention maps with no segmentation supervision.

That last property — *segmentation emerges from self-supervision* — is what made DINO famous beyond benchmarks.

DINOv2 and registers

DINOv2 (Oquab et al., 2023) — scaled DINO to 142M curated images. Became the standard pretrained vision encoder for dense prediction.
DINOv2 + Registers (Darcet et al., 2023) — adds extra learnable tokens with no positional meaning; the model uses them as a scratchpad for global information, freeing real patch tokens. Cleaner attention maps, better dense prediction.

Part 2 — Masked Autoencoders (MAE)

If DINO is vision's BYOL, then MAE (He, Chen et al., CVPR 2022) is vision's BERT.

The core idea

BERT pretrains language models by masking ~15% of tokens and asking the model to predict them. Can we do the same for vision?

Naïvely: mask 15% of image patches and reconstruct. Doesn't work well. The model shortcuts — it copies texture from neighbouring patches without learning any real semantics. Vision is more redundant than language: a missing patch is almost always predictable from its neighbours' colours and textures.

MAE's resolution: mask much more aggressively. The lecture states this explicitly:

*BERT-like LMs mask 15% of the tokens. MAEs choose to remove 75% of the image tokens.*

*Key takeaway: make pretraining difficult, otherwise model will shortcut and not learn meaningful stuff.*

Forcing the model to reconstruct 75% from only 25% visible patches makes the task genuinely hard — there's no local-texture shortcut. The model is forced to learn global structure.

The asymmetric encoder-decoder

MAE's architecture has a brilliant efficiency trick:

1. Patch the image (ViT-style): $N$ patches. 2. Randomly mask 75% of patches → keep only 25% visible. 3. Encoder (deep ViT): visible 25% only. 4. Re-insert mask tokens at masked positions, each with its positional embedding. 5. Concatenate visible-encoded tokens + mask tokens. 6. Decoder (small ViT): processes the full sequence. 7. Reconstruct pixels of the masked patches.

The encoder only ever sees 25% of the tokens:

Encoder cost drops by ~4×. You're paying compute for only a quarter of the sequence.
The encoder learns to encode the visible signal into something useful (no need to think about masked positions during the encoder pass).
The decoder is lightweight — it does the reconstruction, then is thrown away after pretraining.

After pretraining, only the encoder is kept.

The loss

Pixel-space MSE on masked patches only:

L_{MAE} = \frac{1}{∣ M ∣} i \in M \sum ∥ x_{i} - \overset{x}{^}_{i} ∥_{2}^{2}

No loss on visible patches (they were given as input — no need to reconstruct).

The masking-ratio ablation

Performance peaks at 75% masking, much higher than BERT's 15%. Memorise this number — exam fodder.

Why higher masking helps in vision but not language: language tokens carry high information density per token (entire words/concepts), so 15% masking already creates a hard task. Image patches are highly redundant — a $16 \times 16$ patch shares textures and colours with its neighbours. You need to remove most of them to force genuine learning.

Why MAE matters

Simple, scalable, effective. After pretraining, the encoder ViT transfers to detection, segmentation, classification — competitive with or better than supervised pretraining at scale. The most successful "generative" SSL approach for vision.

Part 3 — JEPA

Both DINO and MAE work. Both have a hidden cost.

DINO learns through pixel-augmentation invariance — but augmentations are handcrafted choices.
MAE learns by pixel reconstruction — but most pixel-level detail (texture, lighting, noise) is irrelevant for high-level semantics.

LeCun and collaborators argue: why predict pixels at all? The hard, semantically meaningful task is to predict the *representation* of a missing region, not the literal RGB values.

JEPA — Joint-Embedding Predictive Architecture — is built on this idea.

I-JEPA architecture (Assran et al., CVPR 2023)

Three components:

Context encoder $f_{θ}$ — sees a context block (a portion of the image), produces representations $z_{ctx}$ .

Target encoder $f_{ξ}$ — sees target blocks (other portions of the image), produces target representations $z_{target}$ . **Trained via EMA of $θ$ **, no gradient.

Predictor $g_{ϕ}$ — given the context representations and the spatial positions of the target blocks, predicts $\overset{z}{^}_{target}$ .

Loss:

L = ∥ \overset{z}{^}_{target} - sg [z_{target}] ∥^{2}

L2 distance in *representation space*, with stop-gradient on the target encoder.

MAE vs JEPA

| | MAE | JEPA | | --- | --- | --- | | What is predicted | Pixels of masked patches | Representations of masked patches | | Loss | Pixel-space MSE | Representation-space L2 | | Target source | The original image | The target encoder (separate network, EMA) | | Wastes capacity on | Textures, noise, exact colours | Nothing — only semantic content | | Computationally expensive part | The pixel decoder | The target encoder |

JEPA's claim: by predicting in abstract representation space, the network doesn't waste capacity on irrelevant pixel-level details. The representation encoder learns features that are useful *as features*, not as paint-by-numbers reconstruction.

Variants

I-JEPA (Image, 2023) — the original.
V-JEPA (Video, 2024) — extends to video; predicts representations of masked spatio-temporal regions.
VL-JEPA (Vision-Language, 2025) — adds a language modality.

These are research frontiers. You should know they exist and what each means.

The big-picture summary

| Family | Examples | Idea | | --- | --- | --- | | Old-school pretext | Jigsaw, colorisation, inpainting | Pretrain on a synthetic task with no labels | | Contrastive | SimCLR, MoCo, CLIP | Bring positive pairs close, push negatives apart | | Self-distillation | DINO, BYOL | No negatives; student matches EMA teacher | | Generative | MAE | Reconstruct masked content in pixel space | | Predictive | JEPA | Predict representations, not pixels |

All modern vision encoders are SSL-pretrained.

What you carry into the exam

DINO = self-distillation with no labels, student backprops, teacher is EMA. The cross-entropy loss and the teacher-update rule with cosine schedule. Multi-crop: 2 globals to both, 6–10 locals to student only. The two collapse modes and the two opposing fixes — *centering prevents single-dim, sharpening prevents uniform; you need both*. DINO results: 78.3% k-NN, 76.1% linear probe; emergent object-localising attention. DINOv2 + registers as scratchpad tokens. MAE = vision's BERT; 75% mask ratio (vs BERT's 15%); asymmetric architecture (encoder sees 25%, decoder reconstructs pixels, only encoder kept). The lecture's punchline: *"make pretraining difficult, otherwise model will shortcut."* JEPA = predict representations, not pixels — context encoder + EMA target encoder + predictor; L2 in feature space with stop-grad. Variants: I-JEPA, V-JEPA, VL-JEPA.

That's DINO, MAE, and JEPA — three of the most important ideas in recent computer vision, and the foundation for the multimodal models in the unit ahead.

Computer Vision