Saral Shiksha Yojna
Courses/Computer Vision

Computer Vision

CSE471
Prof. Makarand Tapaswi + Prof. Charu SharmaSpring 2025-264 credits

ViT Pipeline, Scaling, and Swin

NotesStory

Intuition

For 9 years (AlexNet 2012 → 2021), every state-of-the-art vision model was a CNN. The locality + translation-equivariance inductive bias of convolutions was treated as a truth about how vision must work. Then Google's *"An Image is Worth 16×16 Words"* (Dosovitskiy et al., ICLR 2021) showed that a plain Transformer — the same architecture used to translate English to French — matched or beat the best CNNs on ImageNet, *provided you fed it enough data*. The recipe is literally in the title: cut the image into patches, treat each as a token, feed a standard Transformer encoder.

Explanation

Why "pixels as a sequence" doesn't work. A RGB image has pixels. Self-attention is in sequence length — that's attention scores per layer, computationally impossible. So you must compress the sequence. The simplest compression: chunk the image into non-overlapping patches and treat each as one token.

Patch tokenisation. Given and patch size : patches. For , : . Each patch is flattened to a -dim vector and linearly projected to dimensions via a learnable matrix . For ViT-B with , , : , so is — patch projection is roughly square (not coincidence; ViT-B was sized to make these match). Equivalent implementation: a single Conv2d with kernel and stride .

**The token.** Following BERT, ViT *prepends* a special learnable token at position 0: the input sequence is , total length . After Transformer encoder layers, the final hidden state of is passed through a single linear layer for the class logits. **Why and not average-pool?** Average-pooling all patch tokens works *similarly* (within ~1% on ImageNet) — ViT keeps because BERT did and the convention stuck. also gives interpretable attention maps showing which regions drove the prediction.

Position embeddings — 1D learned wins. The original paper compared three choices: *none* (drop position entirely — accuracy drops ~3%, surprising); *1D learned* (each patch gets a 1D index by raster scan, learn a -dim embedding per index — this is what ViT uses); *2D learned* (separate row and column embeddings — marginal gain). Memorise: ViT uses 1D learned PEs despite images being 2D. The Transformer figures out the 2D structure on its own.

The Transformer block — Pre-Norm. Same as the 2017 original with one tweak: ViT uses Pre-Norm. For block : ; . The "Simple notation!" slide flags this — you should be able to write the block in 2 lines from memory.

The MLP sublayer. with and . The 4× expansion ratio is standard ( for ViT-B). The "MLP size" in lecture slides is the *intermediate* dimension .

The ViT family — memorise the specs. ViT-Tiny: , , MLP = 768, , ~5M params. ViT-Small: , , MLP = 1536, , ~22M. **ViT-Base: , , MLP = 3072, , ~86M.** ViT-Large: , , MLP = 4096, , ~307M. ViT-Huge: , , MLP = 5120, , ~632M. Consistent pattern: ; per-head dimension . *ViT-B/16* means ViT-Base with patches.

Parameter calculation (exam favourite, ViT-B/16). *One block:* attention each . MLP up () + down () → . **Per block: .** All blocks: M. Patch embedding (one-time): M. Total: M, matching the slide's . LayerNorm, token, and position embeddings are individually negligible.

The quiz — the centrepiece, will be on your exam. What happens to *parameters* and *sequence length* when …

**Q1 — Resolution (patch size fixed at 16).** Old . New . **Sequence length: increases (). Parameters: ~unchanged** — Transformer weights are all regardless of ; patch embedding is unchanged because are unchanged. *Subtlety:* learned positional embeddings are — they don't fit the new length. Standard fix: bilinear interpolation of the position embeddings to the new grid; doesn't add parameters.

**Q2 — Patch size (resolution fixed at 224).** Old . New . **Sequence length: increases exactly . Parameters: slightly change.** : shrinks to (factor of 4 smaller). The rest of the network is unchanged. *Dominant effect:* 4× longer sequence → more attention compute, more attention memory.

**Q3 — Layers (everything else fixed). Sequence length: unchanged. Parameters: increase linearly in .** Each block adds ; going adds M.

**Q4 — Heads ( constant).** *The trickiest one.* In MHA, is partitioned across heads — each head has . The projections are all regardless of how that is split. Sequence length: unchanged. Parameters: unchanged. This is the surprising answer most students miss. *Heads partition ; they don't multiply parameters.* The exam line: *number of heads doesn't change MSA parameter count — it only changes how the -dim space is partitioned for parallel attention computation.*

Data hunger — ViT's defining property. At small data scales (ImageNet-only), ResNets win. At large data scales (JFT-300M, 300 million images), ViT wins by a wide margin. *Reason:* CNNs come with built-in inductive biases — *locality* and *translation equivariance* — which are useful when data is small but *limiting* when data is large. ViT has no spatial inductive bias beyond the patch grid; every patch attends to every other from layer 1. With small data this overfits; with huge data it learns richer relationships than CNNs. "ViTs transfer better than ResNets" comes from this scaling: pretrain on JFT-300M, transfer to ImageNet (+18 other tasks) — ViTs win across the board.

Do ViTs see like CNNs? Raghu et al., NeurIPS 2021. Analysis uses CKA (Centered Kernel Alignment), a similarity metric for representations. *(a)* CNN representations evolve gradually across layers: early = textures, middle = parts, late = objects — a clear hierarchy. *(b)* ViT representations are remarkably uniform across layers: even early layers attend globally; no clear feature pyramid. The attention distance slide makes this concrete — in CNNs, early receptive fields are tiny (); in ViTs, even layer 1 has many heads attending across the full image. *Strength:* ViT captures global context immediately. *Weakness:* no multi-scale feature pyramid, making vanilla ViT worse for dense prediction (segmentation, detection).

**What the token learns.** Through training, 's attention weights concentrate on semantically meaningful regions — the object, salient parts. This emergent object-localising behaviour is what DINO amplified into actual *segmentation* maps without any segmentation labels.

Position embeddings are self-similar. A striking finding: take ViT's learned 1D positional embeddings — one per patch — and compute cosine similarity between 's embedding and all other positions. You get a 2D pattern centred at with high similarity decreasing radially. The 1D embeddings spontaneously discovered the 2D layout from training, no 2D structure imposed.

Swin Transformer — fixing ViT's two weaknesses. ViT has two acknowledged problems: (1) quadratic complexity in sequence length — at high resolution, explodes and self-attention becomes prohibitive; (2) no hierarchical features — object detection and segmentation work best with multi-scale feature pyramids (FPN), but ViT has one resolution throughout. Swin (Liu et al., ICCV 2021 Best Paper) fixes both. The name is Shifted Window Transformer.

Swin Idea 1 — Windowed self-attention. Divide the patches into non-overlapping windows (typically ) and compute self-attention *only within each window*. Standard ViT attention: over the whole image. Window attention: — **linear in ** because is constant. For a patch grid: full attention costs M scores per layer; windowed at costs k. About cheaper.

The cross-window problem. Pure window attention has an obvious flaw: tokens in different windows never talk. Global structure is lost.

Swin Idea 2 — Shifted windows. In *alternate* Transformer blocks, shift the window grid by pixels. Block = W-MSA (windows aligned); block = SW-MSA (windows shifted); block = W-MSA again; etc. After the shift, patches that were in different windows now share one. Information propagates across original window boundaries; after a few layers the effective receptive field expands much like a CNN. The shift produces some edge windows of irregular size; Swin handles this with cyclic shift + masked self-attention — patches are cyclically wrapped to maintain regular window sizes, with attention masks blocking wrap-around computations.

Swin Idea 3 — Hierarchical patch merging. Standard ViT keeps the same number of tokens throughout. Swin progressively downsamples: Stage 1: patches at channels → Stage 2: at → Stage 3: at → Stage 4: at . Patch merging at the start of each new stage: take a block of patches, concatenate features, linearly project . Halves spatial dims and doubles channels — *exactly like a strided conv in a CNN*. Result: a multi-scale hierarchical feature pyramid like ResNet/FPN; drop-in replacement for CNNs in detection (Mask R-CNN with Swin) and segmentation pipelines.

Swin vs ViT in one table. *Attention scope:* global (all patches) vs local windows + shifted alternation. *Complexity:* vs . *Feature hierarchy:* single resolution vs 4-stage pyramid. *Best for:* classification + contrastive pretraining vs dense prediction. *When to use:* large-scale pretraining target vs backbone for downstream perception tasks.

Definitions

  • Patch embeddingLinear projection of flattened patches to -dim token embeddings; equivalently a Conv2d with kernel and stride .
  • [CLS] tokenLearnable summary token prepended to the patch sequence; its final hidden state is passed through a linear head for classification. Alternative: global-average-pool patch tokens (~equivalent accuracy).
  • Inductive biasArchitectural priors. CNNs have locality + translation equivariance baked in. ViTs have neither — they must learn them from data. Small data: bias helps. Large data: bias limits.
  • Pre-NormLayerNorm placed *before* each sublayer (MSA or MLP), with the residual added *after*. . Stable for deep stacks; used by ViT.
  • ViT-B/16Vision Transformer Base with patches: , , MLP , , ~86M params.
  • Attention distanceAverage spatial distance over which a head attends. CNNs: small in early layers, grows with depth. ViTs: spans both small and large distances even in layer 1.
  • CKA (Centered Kernel Alignment)Similarity metric for comparing representations across layers/models. Used by Raghu et al. to show ViT and CNN representations differ qualitatively.
  • JFT-300MGoogle's internal 300M-image dataset; ViT's pretraining target where it overtakes ResNet by a wide margin.
  • Swin TransformerShifted-Window Transformer (Liu et al., ICCV 2021). Window self-attention ( per layer) + shifted windows in alternate blocks (cross-window communication) + hierarchical patch merging (4-stage pyramid).
  • W-MSA / SW-MSAWindow multi-head self-attention with aligned windows / with shifted windows. Swin alternates these between blocks.
  • Patch merging (Swin) group of patches concatenated and projected ; halves spatial dims and doubles channels — like strided conv in CNNs.
  • PE interpolationBilinearly interpolating learned 1D position embeddings to a new sequence length when fine-tuning at higher resolution; no new parameters.

Formulas

Derivations

ViT-B/16 parameter count, end-to-end. Per block: attention M; MLP M; total per block M. Twelve blocks: M. Patch embedding : M. Classifier head: , negligible for . **Grand total M**, matching the slide.

The four quiz answers in one place. *Resolution ↑:* sequence ↑ quadratically, params ~unchanged (interpolate PEs). *Patch size ↓:* sequence ↑, patch-embedding params shrink slightly, attention compute scales heavily. *Layers ↑:* params ↑ linearly, sequence unchanged. *Heads ↑ (D constant):* params *and* sequence unchanged — heads partition , don't multiply.

**Window attention is linear in .** Image has patches partitioned into windows of patches each. Within-window attention costs per window, in total — linear in because is a constant (typically 7). Compare full ViT for : vs windowed . ~64× cheaper at the same resolution.

PE interpolation when fine-tuning at higher resolution. Pretrain at (PE shape ); fine-tune at (need PE shape ). Reshape pretrained PEs back to 2D grid, bilinearly interpolate to the new grid size, flatten — no new parameters, smooth initialisation.

Position-embedding 2D self-similarity is emergent. ViT's PEs are 1D-learned with no 2D structure imposed. Yet after training, as a function of is approximately a 2D bump centred at the origin — the model discovered the 2D layout from training signal alone.

Examples

  • Patch-count examples. at patches (+CLS = 197 tokens). at patches (4× faster, much less detail). at patches.
  • ViT-B/16 vs ViT-B/32 — same family, different patches. Both , , MLP = 3072, , ~86M params. The variant sees fewer tokens and is faster but loses fine detail; mostly used in DINO/SigLIP for efficiency.
  • Swin-T window cost. at patches. Full attention: M scores. Windowed at ( windows of patches): k. ~64× cheaper.
  • Shifted-window propagation. Block windows aligned to ; patch at (window edge) talks to inside its window but not in the next window. Block shifted by — now and share a window. After 2 layers, every patch has seen its 8-neighbourhood.
  • [CLS] attention visualisation. Trained ViT-B; visualise the average of 's attention weights to all 196 patches. Result: a heatmap concentrating on the object — the cat's face, the dog's nose, the bird's body. This emergent localisation is what DINO refined into segmentation.
  • Quiz Q4 head-count surprise. . One head: total M. 12 heads of : per head is each → k per head M. Identical.

Diagrams

  • ViT pipeline. image → 14×14 grid of patches → flatten to → linear projection → prepend → add 1D PE → 12 Pre-Norm Transformer blocks → take final state → MLP head → 1000 logits.
  • Transformer block (Pre-Norm). .
  • ViT vs CNN attention distance. Side-by-side: CNN heads have tiny receptive fields early, growing with depth. ViT heads include both small-distance and full-image-distance attentions in every layer.
  • Swin shifted-window mechanism. Two adjacent layers: layer 1 windows aligned to origin (red grid); layer 2 windows shifted by (blue grid offset); patches at original boundaries now sit in window interiors.
  • Swin hierarchical pyramid. Four stages: . Patch merging at each stage transition.
  • Position-embedding cosine similarity grid. patches; pick ; visualise as a heatmap — high at , decreasing radially, recovering 2D structure from 1D-learned embeddings.

Edge cases

  • ViT on small datasets underperforms CNNs. No inductive biases means more data is required to learn what CNNs get for free. ImageNet-only ViT loses to ResNet; JFT-300M ViT wins.
  • PE interpolation when fine-tuning at higher resolution. Without interpolation, learned PEs are undefined for new positions; accuracy collapses.
  • Token redundancy. Many patches contribute little (background sky, uniform texture); methods like DynamicViT drop 30–50% of tokens with minimal accuracy loss.
  • Swin edge windows after shift. Shifting goes off-grid → edge windows are irregularly shaped. Mitigation: cyclic shift + masked attention to preserve regular window shapes.
  • ** vs global-average-pool.** Within ~1% on classification, but gives more interpretable attention maps; some downstream pipelines (DINOv2 linear probes) prefer pooled features.
  • Swin is bad for contrastive pretraining at scale. Window locality conflicts with global discrimination; plain ViT is the default backbone for CLIP-style pretraining.
  • Mixed-resolution batches. Standard ViT cannot batch images of different resolutions without padding/cropping; recent variants (NaViT, SigLIP 2) use packed sequences.

Common mistakes

  • Saying patch embedding is *"just flattening"* — it's a learned linear projection (equivalently, a Conv2d with kernel and stride ).
  • Forgetting the token in sequence-length accounting — the sequence is , not .
  • Stating ViT has *"no positional information"* — it has learned 1D positional embeddings.
  • Answering Quiz Q4 (heads ↑) with *"parameters increase"* — they don't. Heads partition , they don't multiply.
  • Treating Swin's shifted window as separate from cyclic shift — the cyclic shift is the *implementation* of the shifted-window operation at the boundaries.
  • Calling Swin *"linear in the image resolution"* without qualification — it's linear *given a fixed window size *. Compute still scales with the number of patches.
  • Comparing CNNs vs ViTs at ImageNet-only and concluding *"ViT is worse"* — the comparison only holds at small data scale; ViT wins at large scale.

Shortcuts

  • Sequence length: .
  • Per-block params: ( attention + MLP).
  • ViT-B/16 ≈ 86M, ViT-L ≈ 307M, ViT-H ≈ 632M.
  • ViT-B specs to remember: , , MLP , , .
  • Pre-Norm everywhere in modern ViT (and the original).
  • **Swin gives per layer + global RF across layers via shifting.**
  • 1D learned PEs in ViT; interpolate them when fine-tuning at higher resolution.
  • Quiz Q4 answer: heads ↑ → params unchanged. Memorise as the counter-intuitive one.

Proofs / Algorithms

Multi-head attention has the same total parameters as single-head. Per-head projections are with . Summing over heads: . Plus → total . Independent of .

**Window self-attention is .** patches split into windows of patches each. Per-window attention: . Total: . For fixed , linear in .

ViT-B parameter count = 85.5M. for the dominant terms. With : .