Revision Notes/Unit 7 — Vision Transformers (ViT)/ViT Pipeline, Scaling, and Swin

ViT Pipeline, Scaling, and Swin

Intuition

For 9 years (AlexNet 2012 → 2021), every state-of-the-art vision model was a CNN. The locality + translation-equivariance inductive bias of convolutions was treated as a truth about how vision must work. Then Google's *"An Image is Worth 16×16 Words"* (Dosovitskiy et al., ICLR 2021) showed that a plain Transformer — the same architecture used to translate English to French — matched or beat the best CNNs on ImageNet, *provided you fed it enough data*. The recipe is literally in the title: cut the image into $16 \times 16$ patches, treat each as a token, feed a standard Transformer encoder.

Explanation

Why "pixels as a sequence" doesn't work. A $224 \times 224$ RGB image has $50, 176$ pixels. Self-attention is $O (n^{2})$ in sequence length — that's $\sim 2.5 \times 1 0^{9}$ attention scores per layer, computationally impossible. So you must compress the sequence. The simplest compression: chunk the image into non-overlapping $P \times P$ patches and treat each as one token.

Patch tokenisation. Given $x \in R^{H \times W \times C}$ and patch size $P$ : $N = (H / P) (W / P)$ patches. For $H = W = 224$ , $P = 16$ : $N = 1 4^{2} = 196$ . Each patch is flattened to a $P^{2} C$ -dim vector and linearly projected to $D$ dimensions via a learnable matrix $E \in R^{P^{2} C \times D}$ . For ViT-B with $P = 16$ , $C = 3$ , $D = 768$ : $P^{2} C = 768$ , so $E$ is $768 \times 768$ — patch projection is roughly square (not coincidence; ViT-B was sized to make these match). Equivalent implementation: a single Conv2d with kernel $P$ and stride $P$ .

**The $[CLS]$ token.** Following BERT, ViT *prepends* a special learnable token at position 0: the input sequence is $[CLS, patch_{1}, \dots, patch_{N}]$ , total length $N + 1 = 197$ . After $L$ Transformer encoder layers, the final hidden state of $[CLS]$ is passed through a single linear layer for the $K$ class logits. **Why $[CLS]$ and not average-pool?** Average-pooling all patch tokens works *similarly* (within ~1% on ImageNet) — ViT keeps $[CLS]$ because BERT did and the convention stuck. $[CLS]$ also gives interpretable attention maps showing which regions drove the prediction.

Position embeddings — 1D learned wins. The original paper compared three choices: *none* (drop position entirely — accuracy drops ~3%, surprising); *1D learned* (each patch gets a 1D index by raster scan, learn a $D$ -dim embedding per index — this is what ViT uses); *2D learned* (separate row and column embeddings — marginal gain). Memorise: ViT uses 1D learned PEs despite images being 2D. The Transformer figures out the 2D structure on its own.

The Transformer block — Pre-Norm. Same as the 2017 original with one tweak: ViT uses Pre-Norm. For block $ℓ = 1, \dots, L$ : $z^{'} = MSA (LN (z)) + z$ ; $z = MLP (LN (z^{'})) + z^{'}$ . The "Simple notation!" slide flags this — you should be able to write the block in 2 lines from memory.

The MLP sublayer. $MLP (x) = Dropout (Linear_{2} (GELU (Linear_{1} (x))))$ with $Linear_{1} : D \to 4 D$ and $Linear_{2} : 4 D \to D$ . The 4× expansion ratio is standard ( $768 \to 3072$ for ViT-B). The "MLP size" in lecture slides is the *intermediate* dimension $4 D$ .

The ViT family — memorise the specs. ViT-Tiny: $L = 12$ , $D = 192$ , MLP = 768, $H = 3$ , ~5M params. ViT-Small: $L = 12$ , $D = 384$ , MLP = 1536, $H = 6$ , ~22M. **ViT-Base: $L = 12$ , $D = 768$ , MLP = 3072, $H = 12$ , ~86M.** ViT-Large: $L = 24$ , $D = 1024$ , MLP = 4096, $H = 16$ , ~307M. ViT-Huge: $L = 32$ , $D = 1280$ , MLP = 5120, $H = 16$ , ~632M. Consistent pattern: $MLP size = 4 D$ ; per-head dimension $D / H \approx 64$ . *ViT-B/16* means ViT-Base with $16 \times 16$ patches.

Parameter calculation (exam favourite, ViT-B/16). *One block:* attention $W_{Q}, W_{K}, W_{V}, W_{O}$ each $D \times D$ → $4 D^{2}$ . MLP up ( $D \to 4 D$ ) + down ( $4 D \to D$ ) → $2 \cdot 4 D^{2} = 8 D^{2}$ . **Per block: $12 D^{2}$ .** All $L = 12$ blocks: $A = 12 \cdot 12 \cdot 76 8^{2} \approx 84.9$ M. Patch embedding (one-time): $B = D \cdot P^{2} C = 768 \cdot 768 \approx 0.6$ M. Total: $A + B \approx 85.5$ M, matching the slide's $85, 524, 480$ . LayerNorm, $[CLS]$ token, and position embeddings are individually negligible.

The quiz — the centrepiece, will be on your exam. What happens to *parameters* and *sequence length* when …

**Q1 — Resolution $224 \to 336$ (patch size fixed at 16).** Old $N = (224/16)^{2} = 196$ . New $N = (336/16)^{2} = 441$ . **Sequence length: increases ( $\sim 2.25 \times$ ). Parameters: ~unchanged** — Transformer weights $Q, K, V, O$ are all $D \times D$ regardless of $N$ ; patch embedding $E$ is unchanged because $P, C, D$ are unchanged. *Subtlety:* learned positional embeddings are $(N + 1) \times D$ — they don't fit the new length. Standard fix: bilinear interpolation of the position embeddings to the new grid; doesn't add parameters.

**Q2 — Patch size $32 \to 16$ (resolution fixed at 224).** Old $N = (224/32)^{2} = 49$ . New $N = (224/16)^{2} = 196$ . **Sequence length: increases exactly $4 \times$ . Parameters: slightly change.** $E \in R^{P^{2} C \times D}$ : $3 2^{2} \cdot 3 \cdot D = 3072 D$ shrinks to $1 6^{2} \cdot 3 \cdot D = 768 D$ (factor of 4 smaller). The rest of the network is unchanged. *Dominant effect:* 4× longer sequence → $\sim 4 \times$ more attention compute, $\sim 16 \times$ more attention memory.

**Q3 — Layers $8 \to 12$ (everything else fixed). Sequence length: unchanged. Parameters: increase linearly in $L$ .** Each block adds $12 D^{2}$ ; going $8 \to 12$ adds $4 \cdot 12 \cdot 76 8^{2} \approx 28$ M.

**Q4 — Heads $1 \to 8$ ( $D$ constant).** *The trickiest one.* In MHA, $D$ is partitioned across heads — each head has $d_{k} = D / H$ . The projections $W_{Q}, W_{K}, W_{V}, W_{O}$ are all $D \to D$ regardless of how that $D$ is split. Sequence length: unchanged. Parameters: unchanged. This is the surprising answer most students miss. *Heads partition $D$ ; they don't multiply parameters.* The exam line: *number of heads doesn't change MSA parameter count — it only changes how the $D$ -dim space is partitioned for parallel attention computation.*

Data hunger — ViT's defining property. At small data scales (ImageNet-only), ResNets win. At large data scales (JFT-300M, 300 million images), ViT wins by a wide margin. *Reason:* CNNs come with built-in inductive biases — *locality* and *translation equivariance* — which are useful when data is small but *limiting* when data is large. ViT has no spatial inductive bias beyond the patch grid; every patch attends to every other from layer 1. With small data this overfits; with huge data it learns richer relationships than CNNs. "ViTs transfer better than ResNets" comes from this scaling: pretrain on JFT-300M, transfer to ImageNet (+18 other tasks) — ViTs win across the board.

Do ViTs see like CNNs? Raghu et al., NeurIPS 2021. Analysis uses CKA (Centered Kernel Alignment), a similarity metric for representations. *(a)* CNN representations evolve gradually across layers: early = textures, middle = parts, late = objects — a clear hierarchy. *(b)* ViT representations are remarkably uniform across layers: even early layers attend globally; no clear feature pyramid. The attention distance slide makes this concrete — in CNNs, early receptive fields are tiny ( $3 \times 3 \to 7 \times 7 \to \dots$ ); in ViTs, even layer 1 has many heads attending across the full image. *Strength:* ViT captures global context immediately. *Weakness:* no multi-scale feature pyramid, making vanilla ViT worse for dense prediction (segmentation, detection).

**What the $[CLS]$ token learns.** Through training, $[CLS]$ 's attention weights concentrate on semantically meaningful regions — the object, salient parts. This emergent object-localising behaviour is what DINO amplified into actual *segmentation* maps without any segmentation labels.

Position embeddings are self-similar. A striking finding: take ViT's learned 1D positional embeddings — one per patch — and compute cosine similarity between $(r, c)$ 's embedding and all other positions. You get a 2D pattern centred at $(r, c)$ with high similarity decreasing radially. The 1D embeddings spontaneously discovered the 2D layout from training, no 2D structure imposed.

Swin Transformer — fixing ViT's two weaknesses. ViT has two acknowledged problems: (1) quadratic complexity in sequence length — at high resolution, $N$ explodes and self-attention becomes prohibitive; (2) no hierarchical features — object detection and segmentation work best with multi-scale feature pyramids (FPN), but ViT has one resolution throughout. Swin (Liu et al., ICCV 2021 Best Paper) fixes both. The name is Shifted Window Transformer.

Swin Idea 1 — Windowed self-attention. Divide the patches into non-overlapping $M \times M$ windows (typically $M = 7$ ) and compute self-attention *only within each window*. Standard ViT attention: $O (N^{2})$ over the whole image. Window attention: $O (N \cdot M^{2}) = O (M^{2} \cdot N)$ — **linear in $N$ ** because $M$ is constant. For a $56 \times 56$ patch grid: full attention costs $\sim 9.8$ M scores per layer; windowed at $7 \times 7$ costs $64 \times 4 9^{2} \approx 154$ k. About $64 \times$ cheaper.

The cross-window problem. Pure window attention has an obvious flaw: tokens in different windows never talk. Global structure is lost.

Swin Idea 2 — Shifted windows. In *alternate* Transformer blocks, shift the window grid by $(M /2, M /2)$ pixels. Block $ℓ$ = W-MSA (windows aligned); block $ℓ + 1$ = SW-MSA (windows shifted); block $ℓ + 2$ = W-MSA again; etc. After the shift, patches that were in different windows now share one. Information propagates across original window boundaries; after a few layers the effective receptive field expands much like a CNN. The shift produces some edge windows of irregular size; Swin handles this with cyclic shift + masked self-attention — patches are cyclically wrapped to maintain regular window sizes, with attention masks blocking wrap-around computations.

Swin Idea 3 — Hierarchical patch merging. Standard ViT keeps the same number of tokens throughout. Swin progressively downsamples: Stage 1: $56 \times 56$ patches at $C$ channels → Stage 2: $28 \times 28$ at $2 C$ → Stage 3: $14 \times 14$ at $4 C$ → Stage 4: $7 \times 7$ at $8 C$ . Patch merging at the start of each new stage: take a $2 \times 2$ block of patches, concatenate features, linearly project $4 C \to 2 C$ . Halves spatial dims and doubles channels — *exactly like a strided conv in a CNN*. Result: a multi-scale hierarchical feature pyramid like ResNet/FPN; drop-in replacement for CNNs in detection (Mask R-CNN with Swin) and segmentation pipelines.

Swin vs ViT in one table. *Attention scope:* global (all patches) vs local windows + shifted alternation. *Complexity:* $O (N^{2})$ vs $O (N)$ . *Feature hierarchy:* single resolution vs 4-stage pyramid. *Best for:* classification + contrastive pretraining vs dense prediction. *When to use:* large-scale pretraining target vs backbone for downstream perception tasks.

Definitions

Patch embedding — Linear projection $E \in R^{P^{2} C \times D}$ of flattened $P \times P \times C$ patches to $D$ -dim token embeddings; equivalently a Conv2d with kernel $P$ and stride $P$ .
[CLS] token — Learnable summary token prepended to the patch sequence; its final hidden state is passed through a linear head for classification. Alternative: global-average-pool patch tokens (~equivalent accuracy).
Inductive bias — Architectural priors. CNNs have locality + translation equivariance baked in. ViTs have neither — they must learn them from data. Small data: bias helps. Large data: bias limits.
Pre-Norm — LayerNorm placed *before* each sublayer (MSA or MLP), with the residual added *after*. $z^{'} = MSA (LN (z)) + z$ . Stable for deep stacks; used by ViT.
ViT-B/16 — Vision Transformer Base with $16 \times 16$ patches: $L = 12$ , $D = 768$ , MLP $= 3072$ , $H = 12$ , ~86M params.
Attention distance — Average spatial distance over which a head attends. CNNs: small in early layers, grows with depth. ViTs: spans both small and large distances even in layer 1.
CKA (Centered Kernel Alignment) — Similarity metric for comparing representations across layers/models. Used by Raghu et al. to show ViT and CNN representations differ qualitatively.
JFT-300M — Google's internal 300M-image dataset; ViT's pretraining target where it overtakes ResNet by a wide margin.
Swin Transformer — Shifted-Window Transformer (Liu et al., ICCV 2021). Window self-attention ( $O (N)$ per layer) + shifted windows in alternate blocks (cross-window communication) + hierarchical patch merging (4-stage pyramid).
W-MSA / SW-MSA — Window multi-head self-attention with aligned windows / with shifted windows. Swin alternates these between blocks.
Patch merging (Swin) — $2 \times 2$ group of patches concatenated and projected $4 C \to 2 C$ ; halves spatial dims and doubles channels — like strided conv in CNNs.
PE interpolation — Bilinearly interpolating learned 1D position embeddings to a new sequence length when fine-tuning at higher resolution; no new parameters.

Formulas

$Patch count: N = (H / P) (W / P); tokens with [CLS] = N + 1$
$Patch embedding: z_{i} = patch_{i} \cdot E + e_{i}^{pos}, E \in R^{P^{2} C \times D}$
$Pre-Norm block: z^{'} = MSA (LN (z)) + z; z = MLP (LN (z^{'})) + z^{'}$
$MLP: Linear_{1} : D \to 4 D \to GELU \to Linear_{2} : 4 D \to D$
$Per-block params: 4 D^{2} (attn) + 8 D^{2} (MLP) = 12 D^{2}$
$ViT total params: L \cdot 12 D^{2} + P^{2} C D + small terms$
$Attention complexity: ViT : O (N^{2} D), Swin : O (M^{2} N D)$
$Per-head dim: d_{k} = D / H \approx 64 across ViT family$
$Swin stage shapes: (H /4, W /4, C) \to (H /8, W /8, 2 C) \to (H /16, W /16, 4 C) \to (H /32, W /32, 8 C)$

Derivations

ViT-B/16 parameter count, end-to-end. Per block: attention $= 4 \cdot 76 8^{2} \approx 2.36$ M; MLP $= 2 \cdot 768 \cdot 3072 \approx 4.72$ M; total per block $\approx 7.08$ M. Twelve blocks: $7.08 \times 12 \approx 84.9$ M. Patch embedding $E$ : $768 \cdot (1 6^{2} \cdot 3) = 768 \cdot 768 \approx 0.59$ M. Classifier head: $768 \cdot K$ , negligible for $K \leq 1000$ . **Grand total $\approx 85.5$ M**, matching the slide.

The four quiz answers in one place. *Resolution ↑:* sequence ↑ quadratically, params ~unchanged (interpolate PEs). *Patch size ↓:* sequence ↑, patch-embedding params shrink slightly, attention compute scales heavily. *Layers ↑:* params ↑ linearly, sequence unchanged. *Heads ↑ (D constant):* params *and* sequence unchanged — heads partition $D$ , don't multiply.

**Window attention is linear in $N$ .** Image has $N$ patches partitioned into $N / M^{2}$ windows of $M^{2}$ patches each. Within-window attention costs $O ((M^{2})^{2}) = O (M^{4})$ per window, $O (M^{4} \cdot N / M^{2}) = O (M^{2} N)$ in total — linear in $N$ because $M$ is a constant (typically 7). Compare full ViT $O (N^{2})$ for $N = 3136$ : $\sim 9.8 \times 1 0^{6}$ vs windowed $\sim 1.5 \times 1 0^{5}$ . ~64× cheaper at the same resolution.

PE interpolation when fine-tuning at higher resolution. Pretrain at $22 4^{2}$ (PE shape $14 \times 14 \times D$ ); fine-tune at $33 6^{2}$ (need PE shape $21 \times 21 \times D$ ). Reshape pretrained PEs back to 2D grid, bilinearly interpolate to the new grid size, flatten — no new parameters, smooth initialisation.

Position-embedding 2D self-similarity is emergent. ViT's PEs are 1D-learned with no 2D structure imposed. Yet after training, $cos (PE (r, c), PE (r^{'}, c^{'}))$ as a function of $(r^{'} - r, c^{'} - c)$ is approximately a 2D bump centred at the origin — the model discovered the 2D layout from training signal alone.

Examples

Patch-count examples. $224 \times 224$ at $P = 16$ → $1 4^{2} = 196$ patches (+CLS = 197 tokens). $224 \times 224$ at $P = 32$ → $7^{2} = 49$ patches (4× faster, much less detail). $336 \times 336$ at $P = 16$ → $2 1^{2} = 441$ patches.
ViT-B/16 vs ViT-B/32 — same family, different patches. Both $L = 12$ , $D = 768$ , MLP = 3072, $H = 12$ , ~86M params. The $/32$ variant sees $4 \times$ fewer tokens and is faster but loses fine detail; mostly used in DINO/SigLIP for efficiency.
Swin-T window cost. $224 \times 224$ at $P = 4$ → $56 \times 56 = 3136$ patches. Full attention: $313 6^{2} \approx 9.8$ M scores. Windowed at $M = 7$ ( $64$ windows of $49$ patches): $64 \cdot 4 9^{2} \approx 154$ k. ~64× cheaper.
Shifted-window propagation. Block $ℓ$ windows aligned to $(0, 0)$ ; patch at $(6, 0)$ (window edge) talks to $(6, 6)$ inside its window but not $(7, 0)$ in the next window. Block $ℓ + 1$ shifted by $(M /2, M /2) = (3, 3)$ — now $(6, 0)$ and $(7, 0)$ share a window. After 2 layers, every patch has seen its 8-neighbourhood.
[CLS] attention visualisation. Trained ViT-B; visualise the average of $[CLS]$ 's attention weights to all 196 patches. Result: a heatmap concentrating on the object — the cat's face, the dog's nose, the bird's body. This emergent localisation is what DINO refined into segmentation.
Quiz Q4 head-count surprise. $D = 768$ . One head: $W_{Q} W_{K} W_{V} W_{O}$ total $4 \cdot 76 8^{2} = 2.36$ M. 12 heads of $d_{k} = 64$ : per head $W_{Q} W_{K} W_{V} W_{O}$ is $768 \cdot 64$ each → $4 \cdot 768 \cdot 64 = 197$ k per head $\times 12 = 2.36$ M. Identical.

Diagrams

ViT pipeline. $22 4^{2}$ image → 14×14 grid of $1 6^{2}$ patches → flatten to $196 \times 768$ → linear projection $E$ → prepend $[CLS]$ → add 1D PE → 12 Pre-Norm Transformer blocks → take $[CLS]$ final state → MLP head → 1000 logits.
Transformer block (Pre-Norm). $z \to LN \to MSA \to + residual \to LN \to MLP (D \to 4 D \to GELU \to 4 D \to D) \to + residual$ .
ViT vs CNN attention distance. Side-by-side: CNN heads have tiny receptive fields early, growing with depth. ViT heads include both small-distance and full-image-distance attentions in every layer.
Swin shifted-window mechanism. Two adjacent layers: layer 1 windows aligned to origin (red grid); layer 2 windows shifted by $(M /2, M /2)$ (blue grid offset); patches at original boundaries now sit in window interiors.
Swin hierarchical pyramid. Four stages: $H /4 \times W /4 \times C \to H /8 \times W /8 \times 2 C \to H /16 \times W /16 \times 4 C \to H /32 \times W /32 \times 8 C$ . Patch merging at each stage transition.
Position-embedding cosine similarity grid. $14 \times 14$ patches; pick $(7, 7)$ ; visualise $cos (PE (7, 7), PE (r, c))$ as a heatmap — high at $(7, 7)$ , decreasing radially, recovering 2D structure from 1D-learned embeddings.

Edge cases

ViT on small datasets underperforms CNNs. No inductive biases means more data is required to learn what CNNs get for free. ImageNet-only ViT loses to ResNet; JFT-300M ViT wins.
PE interpolation when fine-tuning at higher resolution. Without interpolation, learned PEs are undefined for new positions; accuracy collapses.
Token redundancy. Many patches contribute little (background sky, uniform texture); methods like DynamicViT drop 30–50% of tokens with minimal accuracy loss.
Swin edge windows after shift. Shifting goes off-grid → edge windows are irregularly shaped. Mitigation: cyclic shift + masked attention to preserve regular window shapes.
** $[CLS]$ vs global-average-pool.** Within ~1% on classification, but $[CLS]$ gives more interpretable attention maps; some downstream pipelines (DINOv2 linear probes) prefer pooled features.
Swin is bad for contrastive pretraining at scale. Window locality conflicts with global discrimination; plain ViT is the default backbone for CLIP-style pretraining.
Mixed-resolution batches. Standard ViT cannot batch images of different resolutions without padding/cropping; recent variants (NaViT, SigLIP 2) use packed sequences.

Common mistakes

Saying patch embedding is *"just flattening"* — it's a learned linear projection $E$ (equivalently, a Conv2d with kernel $P$ and stride $P$ ).
Forgetting the $[CLS]$ token in sequence-length accounting — the sequence is $N + 1$ , not $N$ .
Stating ViT has *"no positional information"* — it has learned 1D positional embeddings.
Answering Quiz Q4 (heads ↑) with *"parameters increase"* — they don't. Heads partition $D$ , they don't multiply.
Treating Swin's shifted window as separate from cyclic shift — the cyclic shift is the *implementation* of the shifted-window operation at the boundaries.
Calling Swin *"linear in the image resolution"* without qualification — it's linear *given a fixed window size $M$ *. Compute still scales with the number of patches.
Comparing CNNs vs ViTs at ImageNet-only and concluding *"ViT is worse"* — the comparison only holds at small data scale; ViT wins at large scale.

Shortcuts

Sequence length: $N + 1 = H W / P^{2} + 1$ .
Per-block params: $12 D^{2}$ ( $4 D^{2}$ attention + $8 D^{2}$ MLP).
ViT-B/16 ≈ 86M, ViT-L ≈ 307M, ViT-H ≈ 632M.
ViT-B specs to remember: $L = 12$ , $D = 768$ , MLP $= 3072$ , $H = 12$ , $d_{k} = 64$ .
Pre-Norm everywhere in modern ViT (and the original).
**Swin gives $O (N)$ per layer + global RF across layers via shifting.**
1D learned PEs in ViT; interpolate them when fine-tuning at higher resolution.
Quiz Q4 answer: heads ↑ → params unchanged. Memorise as the counter-intuitive one.

Proofs / Algorithms

Multi-head attention has the same total parameters as single-head. Per-head projections are $W_{Q}^{(i)}, W_{K}^{(i)}, W_{V}^{(i)} \in R^{D \times d_{k}}$ with $d_{k} = D / H$ . Summing over $H$ heads: $H \cdot 3 D d_{k} = H \cdot 3 D \cdot D / H = 3 D^{2}$ . Plus $W_{O} \in R^{D \times D}$ → total $4 D^{2}$ . Independent of $H$ .

**Window self-attention is $O (M^{2} N)$ .** $N$ patches split into $N / M^{2}$ windows of $M^{2}$ patches each. Per-window attention: $O ((M^{2})^{2}) = O (M^{4})$ . Total: $(N / M^{2}) \cdot O (M^{4}) = O (M^{2} N)$ . For fixed $M$ , linear in $N$ .

ViT-B parameter count = 85.5M. $L \cdot 12 D^{2} + P^{2} C D$ for the dominant terms. With $L = 12, D = 768, P = 16, C = 3$ : $12 \cdot 12 \cdot 76 8^{2} + 1 6^{2} \cdot 3 \cdot 768 = 12 \cdot 12 \cdot 589, 824 + 768 \cdot 768 = 84, 934, 656 + 589, 824 = 85, 524, 480$ .

Computer Vision