Revision Notes/Unit 6 — Attention & Transformers/Attention Mechanism & Transformer Architecture

Attention Mechanism & Transformer Architecture

Intuition

Pre-2014 sequence-to-sequence models squeezed an entire source sentence through a single fixed vector — the *bottleneck*. Attention (Bahdanau 2015) let the decoder *look back* over the encoder's full history at each step, picking out the positions relevant to the word being generated. Three years later, Vaswani et al. 2017 asked the radical question: do we need the RNN at all? *Attention Is All You Need* — and every modern vision/language model from ViT to GPT to PaliGemma descends from that one paper.

Explanation

The three landmark papers — memorise the chain. *(1)* Bahdanau et al., ICLR 2015 — "Neural Machine Translation by Jointly Learning to Align and Translate" — invented attention as an *add-on* to RNNs. *(2)* Xu et al., ICML 2015 — "Show, Attend and Tell" — applied Bahdanau attention to image captioning. *(3)* Vaswani et al., NeurIPS 2017 — "Attention Is All You Need" — killed the RNN entirely; the whole model is attention. This is the Transformer.

Three Seq2Seq task types you must name. *Image captioning* — image → text (single input, sequence output). *Sentiment classification* — text → label (sequence input, single output). *Machine translation* — text → text (sequence to sequence).

The encoder-decoder paradigm and its bottleneck. Encoder RNN reads the source into hidden states $h_{1}, \dots, h_{N}$ ; takes the final $h_{N}$ as a single fixed vector summarising the entire source. Decoder RNN initialises from $h_{N}$ and generates tokens autoregressively: at each step, take the previously-generated token as input, produce a probability over the vocabulary, sample/argmax, repeat. The bottleneck: one vector cannot hold an entire sentence. *"The cat that sat on the mat"* and *"The cat sat on the mat"* must compress to the same 512-dim vector? Impossible. As source sentences get longer, performance collapses. This is the failure attention fixes.

Bahdanau attention — the four-step recipe. At decoder step $t$ : *(1)* Compute alignment scores $e_{t, i} = score (h_{t - 1}^{dec}, h_{i}^{enc})$ for $i = 1, \dots, N$ . Bahdanau's original score is the additive MLP $e_{t, i} = v^{⊤} tanh (W_{dec} h_{t - 1}^{dec} + W_{enc} h_{i}^{enc})$ . *(2)* Softmax: $α_{t, i} = exp (e_{t, i}) / \sum_{j} exp (e_{t, j})$ — these are the attention weights, summing to 1 over $i$ . *(3)* Context vector $c_{t} = \sum_{i} α_{t, i} h_{i}^{enc}$ — weighted average of encoder states focused on positions relevant to step $t$ . *(4)* Combine and predict: $h_{t}^{dec} = RNN ([y_{t - 1}; c_{t}], h_{t - 1}^{dec})$ ; $p (y_{t}) = softmax (W_{out} h_{t}^{dec})$ .

Attention learns alignment as a byproduct. When you visualise the matrix $α_{t, i}$ for a translated sentence, you see a near-diagonal pattern — generating the French word for "cat" peaks attention on the encoder state for "cat". *No alignment supervision is given*, yet alignment emerges. This is the iconic Bahdanau heatmap and a guaranteed exam talking point.

Soft vs hard attention. *Soft* — $α_{t, i}$ is a continuous distribution, $c_{t}$ is a weighted average, differentiable. *Hard* — sample one encoder position discretely (or argmax); not differentiable, trained with REINFORCE. Image-captioning's Show-Attend-Tell compared both; soft won and became standard.

Training: teacher vs student forcing. *Teacher forcing* — feed the *ground-truth* previous word $y_{t - 1}^{⋆}$ to the decoder. Fast, stable, but creates *exposure bias* — at inference the model only sees its own predictions, which may differ from the training distribution. *Student forcing* (scheduled sampling) — feed the decoder's own previous prediction $\overset{y}{^}_{t - 1}$ . More realistic but harder to train (early mistakes propagate). At inference: always student forcing — there's no ground truth. Memorise this contrast.

Inference: greedy vs beam search. *Greedy* — pick the argmax token at each step. Fast, locally myopic. *Beam search* — keep the top- $B$ partial sequences at each step, expand and score each extension, keep the top- $B$ again. After $T$ steps, return the highest-scoring complete sequence. Trade compute for quality; typical $B = 4$ or $5$ . Standard for translation and captioning.

Softmax temperature. $p_{i} = exp (z_{i} / T) / \sum_{j} exp (z_{j} / T)$ . $T = 1$ standard; $T ≫ 1$ flattens (exploratory); $T ≪ 1$ peaks (deterministic, near one-hot). The same idea appears in DINO's teacher sharpening (low $τ_{t}$ ) and in InfoNCE.

The Transformer — kill the RNN. The recurrence in RNNs has a real cost: it *serialises* computation. You cannot process step 5 until step 4 finishes. If attention already lets the decoder look at any encoder position, why not let *every* encoder position look at every *other* encoder position, in parallel, with no recurrence at all? That's the Transformer. Original paper: $N = 6$ encoder + $N = 6$ decoder blocks; modern variants use 12–96+.

Self-attention — the engine. For an input $X \in R^{n \times d}$ : project to Queries, Keys, Values with three learnable matrices $Q = X W_{Q}$ , $K = X W_{K}$ , $V = X W_{V}$ (each $R^{d \times d_{k}}$ , with $d_{v} = d_{k}$ typically). Then scaled dot-product attention: $Attn (Q, K, V) = softmax (Q K^{⊤} / d_{k}) V$ . Read it: $Q K^{⊤}$ is $n \times n$ — every token's query dot-producted with every token's key (pairwise similarities). Softmax along the last axis gives, for each query, a probability over keys. Multiply by $V$ — each query gets a weighted average of values. Output: $n \times d_{v}$ .

**Why divide by $d_{k}$ ? — exam-gold.** With $Q, K$ entries i.i.d. unit-variance, $(Q K^{⊤})_{ij} = \sum_{k} Q_{ik} K_{j k}$ has variance $d_{k}$ . For $d_{k} = 64$ , std is 8 — entries push softmax into the *saturated regime* where one logit dominates and gradients vanish. Dividing by $d_{k}$ rescales variance back to 1, keeping softmax in its useful regime. "Why scaled?" → "to prevent softmax saturation."

Multi-head attention. Instead of one big attention with $d_{k} = d$ , run $h$ parallel heads each with $d_{k} = d / h$ : $head_{i} = Attn (X W_{Q}^{(i)}, X W_{K}^{(i)}, X W_{V}^{(i)})$ ; output $= Concat (head_{1}, \dots, head_{h}) W_{O}$ with $W_{O} \in R^{d \times d}$ . Each head can specialise on a different relationship — one for syntactic dependency, one for coreference, one for local patterns. Critical detail: heads do not change the total parameter count. Whether you have 1 head with $d_{k} = 768$ or 12 heads with $d_{k} = 64$ , the $W_{Q}, W_{K}, W_{V}$ together always use $3 d^{2}$ parameters. Heads just partition the per-head dimension.

Three flavours of attention in the Transformer. *(1)* Encoder self-attention — $Q, K, V$ all from the input sequence; no masking; bidirectional. *(2)* Decoder masked self-attention — $Q, K, V$ all from the output-so-far; causal mask prevents peeking at future tokens. Mask is upper-triangular with $- \infty$ above the diagonal; after softmax these become 0. Synonyms for the mask: *causal mask, autoregressive mask, look-ahead mask, left-to-right mask*. *(3)* Cross-attention — $Q$ from the decoder's current state, $K, V$ from the encoder's output. Cross-attention is exactly Bahdanau attention generalised to Q-K-V form.

Positional encoding — restoring order. Self-attention is permutation-equivariant: shuffle the input tokens and the output tokens shuffle the same way. There is no notion of "position 3 vs 5" — without help, *"the cat sat on the mat"* and *"the mat sat on the cat"* produce equivalent outputs. Vaswani's fix: sinusoidal positional encoding added to the input embeddings: $PE (pos, 2 i) = sin (pos /1000 0^{2 i / d})$ , $PE (pos, 2 i + 1) = cos (pos /1000 0^{2 i / d})$ . Each position gets a unique $d$ -dim vector built from sine/cosine pairs at exponentially decreasing frequencies. Why sinusoids? $PE (pos + k)$ is a *linear function* of $PE (pos)$ for any fixed offset $k$ , so the network can learn to use relative positions naturally. Variants you'll meet later: *learned absolute PEs* (BERT, ViT — simpler, but don't extrapolate beyond training-time lengths); RoPE (rotary, multiplicative on $Q$ and $K$ ; relative-position-aware by construction; extrapolates better); 2D RoPE / M-RoPE for images and video.

The Transformer block in one snapshot. Encoder block: $z^{'} = MSA (LN (z)) + z$ ; $z = MLP (LN (z^{'})) + z^{'}$ . Decoder block: masked-MSA → cross-attention → MLP, each Add & Norm. The MLP is $Linear (d \to 4 d) \to GELU \to Linear (4 d \to d)$ — the 4× expansion ratio is the standard. Original 2017 paper used Post-Norm ( $LN (x + sublayer)$ ); modern implementations use Pre-Norm ( $x + sublayer (LN (x))$ ) for stable deep stacks (this is what we'll meet again in Transformer Advances).

Practical batching: padding + masking. Sequences in a batch have different lengths → pad to the longest with [PAD]. Padding mask sets attention scores at padding positions to $- \infty$ so they don't influence real tokens. Decoder needs both the *padding mask* AND the *causal mask* — combine element-wise (min, or sum in log-space).

Show-Attend-and-Tell — the image-captioning precursor. Image → CNN → grid of $L$ spatial features $X \in R^{L \times D}$ . At each LSTM step $t$ , compute attention weights over the $L$ locations conditioned on the LSTM state $h_{t - 1}$ → context vector $z_{t}$ ; LSTM update $h_{t} = LSTM (z_{t}, y_{t - 1}, h_{t - 1})$ . Visualising the attention maps shows the decoder "looking" at relevant image regions for each word — but attention reveals only *where* the model looks, not *whether it sees correctly*. The adversarial-colour experiment (caption says "red traffic light" but the model's heatmap is on an unrelated red object) exposes this.

Why the Transformer won — three reasons. *(1)* Parallelisation — self-attention computes all token interactions in parallel; training time on GPUs is dramatically faster than RNNs. *(2)* Constant path length — every token sees every other in one layer; no long-distance information decay. *(3)* Universality — the same architecture handles text, images (ViT), audio, video, even DNA — just change the tokenisation. Within 4 years (2017 → 2021), Transformers had taken over NLP, vision, speech, and multimodal everything.

Definitions

Seq2Seq bottleneck — Pre-attention encoder-decoder RNNs compressed the entire source into a single fixed hidden vector; performance collapsed on long inputs.
Bahdanau attention — Per-decoder-step weighted sum over encoder hidden states; weights computed by an additive MLP score; learns alignment as a byproduct.
Q / K / V — Query / Key / Value — three learned projections of the input. Self-attention: all from same sequence. Cross-attention: Q from decoder, K, V from encoder.
Scaled dot-product attention — $softmax (Q K^{⊤} / d_{k}) V$ . The $d_{k}$ keeps softmax in its useful (non-saturated) regime.
Multi-head attention — $h$ parallel attention heads, each with $d_{k} = d / h$ ; concatenate outputs and project with $W_{O}$ . Same total params as single-head; heads can specialise.
Causal mask (look-ahead mask) — Upper-triangular mask of $- \infty$ ; added to attention scores pre-softmax; prevents the decoder from attending to future tokens.
Cross-attention — Q from decoder's current state, K, V from encoder's output. Equivalent to Bahdanau attention in Q-K-V form.
Sinusoidal positional encoding — Vaswani's $sin / cos$ at exponentially decreasing frequencies; allows linear expression of relative position; generalises to unseen lengths.
Pre-Norm vs Post-Norm — Post-Norm (Vaswani 2017): $LN (x + sublayer (x))$ ; needs warmup. Pre-Norm (modern): $x + sublayer (LN (x))$ ; stable for deep stacks.
Teacher forcing / student forcing — Training the decoder with ground-truth previous tokens (teacher) vs predicted previous tokens (student). Inference is always student-forcing.
Beam search — Maintain top- $B$ partial sequences at each step; expand and score; keep top- $B$ again. Trades compute for quality; $B = 4 - 5$ typical.
Soft vs hard attention — Soft: continuous weighted average, differentiable. Hard: discrete sampling of one position, requires REINFORCE.

Formulas

$Bahdanau score: e_{t, i} = v^{⊤} tanh (W_{dec} h_{t - 1}^{dec} + W_{enc} h_{i}^{enc})$
$α_{t, i} = \frac{exp ( e _{t, i} )}{\sum _{j} exp ( e _{t, j} )}, c_{t} = i \sum α_{t, i} h_{i}^{enc}$
$Scaled dot-product: Attn (Q, K, V) = softmax (\frac{Q K ^{⊤}}{d _{k}}) V$
$MHA: Concat (head_{1}, \dots, head_{h}) W_{O}, head_{i} = Attn (X W_{Q}^{(i)}, X W_{K}^{(i)}, X W_{V}^{(i)})$
$Causal mask: M_{ij} = {0 - \infty j \leq i j > i, Attn = softmax (\frac{Q K ^{⊤} + M}{d _{k}}) V$
$PE (pos, 2 i) = sin (pos /1000 0^{2 i / d}), PE (pos, 2 i + 1) = cos (pos /1000 0^{2 i / d})$
$FFN (x) = Linear_{2} (GELU (Linear_{1} (x))), Linear_{1} : d \to 4 d, Linear_{2} : 4 d \to d$
$y = LayerNorm (x + Sublayer (x)) (Post-Norm — Vaswani 2017)$
$y = x + Sublayer (LayerNorm (x)) (Pre-Norm — modern)$
$Softmax temperature: p_{i} = \frac{exp ( z _{i} / T )}{\sum _{j} exp ( z _{j} / T )}$

Derivations

**Why divide by $d_{k}$ — the variance argument.** Assume $Q_{i}, K_{j} \in R^{d_{k}}$ have i.i.d. zero-mean, unit-variance components. Then $(Q K^{⊤})_{ij} = \sum_{k = 1}^{d_{k}} Q_{ik} K_{j k}$ is a sum of $d_{k}$ i.i.d. products; each product has variance 1, so the total has variance $d_{k}$ and std $d_{k}$ . For $d_{k} = 64$ , std $= 8$ — large entries push softmax into the *saturated regime* where one logit dominates and gradients flatten. Dividing by $d_{k}$ rescales the variance back to 1, putting softmax in its useful regime.

Why sinusoidal PEs encode relative position linearly. $PE (pos + k)$ uses sines and cosines at the same frequencies. For each frequency $ω = 1/1000 0^{2 i / d}$ , the angle-addition identity gives $sin (ω (pos + k)) = sin (ω pos) cos (ω k) + cos (ω pos) sin (ω k)$ — a linear combination of $PE (pos)$ 's components, with coefficients depending only on $k$ . So a linear layer can recover relative offset $k$ from the absolute PEs.

Self-attention's path-length advantage. In an RNN, position $n$ depends on position $1$ through $n - 1$ sequential cell applications — $O (n)$ path length, vanishing-gradient risk. In self-attention, position $n$ attends directly to position $1$ in one layer — constant path length. Cost: $O (n^{2})$ pairwise scores per layer (the bill Flash Attention pays).

Parameter count of a Transformer block. Self-attention: 4 projections ( $W_{Q}, W_{K}, W_{V}, W_{O}$ ) each $d^{2}$ → $4 d^{2}$ . MLP: $d \to 4 d$ then $4 d \to d$ → $2 \cdot 4 d^{2} = 8 d^{2}$ . Total per block: $12 d^{2}$ (ignoring LayerNorm's $O (d)$ ). For $d = 512$ : $12 \cdot 51 2^{2} \approx 3.1$ M per encoder block; for $d = 768$ (ViT-B): $12 \cdot 76 8^{2} \approx 7.1$ M.

Multi-head parameter count is the same as single-head. $h$ heads, each with per-head $d_{k} = d / h$ . Per-head projections are $d \times d_{k}$ . Sum over heads: $h \cdot d \cdot d_{k} = h \cdot d \cdot d / h = d^{2}$ . Identical to one head with $d_{k} = d$ . Heads partition the dimension; they don't multiply parameters.

Examples

Bahdanau alignment heatmap. Translate "The agreement on the European Economic Area was signed in August 1992" to French. The $α_{t, i}$ matrix shows near-diagonal peaks; "signed" attends to "signé", "août" attends to "August". No alignment supervision was given — the model learns it.
**Causal mask for a 4-token target $[BOS, y_{1}, y_{2}, y_{3}]$ .** $M = 0000 - \infty 000 - \infty - \infty 00 - \infty - \infty - \infty 0$ . After softmax, $- \infty$ → 0; token $i$ can only attend to tokens $\leq i$ .
Cross-attention in MT. Source English "the cat"; target French "le chat". Decoder query for "le" attends most to encoder K for "the"; query for "chat" attends most to "cat". The corresponding V vectors are the encoder's contextual representations, returned to the decoder.
Scaled dot-product worked example. $d_{k} = 64$ , $Q_{i}, K_{j}$ with i.i.d. $N (0, 1)$ components. Unscaled dot product: variance 64, std 8 — softmax mostly concentrates on one entry. Scaled by $64 = 8$ : variance 1 — softmax produces a smooth distribution, gradients flow.
Multi-head head count vs params. For $d = 768$ : $W_{Q} W_{K} W_{V} W_{O}$ total $4 \cdot 76 8^{2} = 2.36$ M, regardless of whether you split into 1, 8, or 12 heads.
Beam search example. $B = 4$ . Vocab has 30k tokens; at each step expand each of 4 beams over 30k options, score 120k extensions, keep top 4. After 20 steps, return highest-scoring complete sequence. Trade 4× compute for typically +1-2 BLEU over greedy.
Softmax temperature in DINO. Teacher $τ_{t} \approx 0.04$ → very peaky distribution → sharp pseudo-targets. Student $τ_{s} \approx 0.1$ → softer outputs. Sharpening pushes teacher away from collapse-to-uniform.

Diagrams

Bahdanau attention — decoder at step $t$ : alignment scores $e_{t, i}$ over all encoder positions, softmax → $α_{t, i}$ , weighted sum → context $c_{t}$ , combined with decoder state to predict $y_{t}$ .
Transformer architecture — encoder stack ( $N$ blocks, each: self-attention + FFN) + decoder stack ( $N$ blocks, each: masked self-attention + cross-attention + FFN). Inputs: source + target embeddings + positional encodings.
Scaled dot-product attention — $Q$ , $K$ , $V$ projections; $Q K^{⊤} / d_{k}$ ; softmax along last axis; multiply by $V$ ; output $n \times d_{v}$ .
Multi-head attention — $h$ parallel heads with their own $W_{Q}^{(i)}, W_{K}^{(i)}, W_{V}^{(i)}$ , concatenate, project with $W_{O}$ .
Causal mask matrix — upper-triangle filled with $- \infty$ ; lower triangle and diagonal with 0; applied additively before softmax.
Sinusoidal PE visualisation — heatmap of $PE [pos] [i]$ for $pos \in [0, 100]$ , $i \in [0, d]$ : high-frequency sinusoids in early dimensions, low-frequency in later dimensions.
Show-Attend-and-Tell — CNN features → spatial attention map per generated word; visualise the heatmap shifting with the caption.

Edge cases

Long sequences are attention-bound. $O (n^{2})$ memory for the $n \times n$ attention matrix; for $n = 4$ k tokens, FP16 attention can dominate the activation memory. Flash Attention reduces wall-clock without changing asymptotics.
Padding interactions with the causal mask. A naive implementation that adds the padding mask after softmax (rather than before) leaks information from pad tokens. Always combine masks pre-softmax.
Learned-PE extrapolation failure. Train ViT/BERT at sequence length 512; evaluate at 1024 → the model has no PE for positions 513–1024 and accuracy collapses. Mitigations: PE interpolation (ViT), RoPE, ALiBi.
**Softmax saturation on large $d_{k}$ .** Without the $1/ d_{k}$ scale, training stalls for any moderately large $d_{k}$ — gradients vanish through the saturated softmax.
Decoder cross-attention quality is encoder-bound. A bad encoder bottlenecks the decoder; cross-attention can only retrieve information that's actually in the K, V outputs.
Exposure bias under teacher forcing. A decoder trained only with ground-truth previous tokens may collapse at inference once it starts seeing its own (possibly wrong) outputs. Scheduled sampling and reinforcement-learning fine-tuning mitigate.

Common mistakes

Stating multi-head attention has " $h \times$ more parameters than single-head" — no, identical. Heads partition $d$ ; total $W_{Q} W_{K} W_{V} W_{O}$ is $4 d^{2}$ regardless of $h$ .
Forgetting the causal mask in the decoder's first sub-layer — leaks future tokens; the model trivially copies and fails at inference.
Adding positional encoding after the encoder (or after the embedding lookup but after layer 1) instead of to the input embedding before block 1.
Writing the score scaling as $/ d_{k}$ instead of $/ d_{k}$ — the variance argument requires the square root.
Saying "cross-attention has $Q, K$ from the decoder and $V$ from the encoder" — no. $Q$ from decoder, $K$ AND $V$ from encoder. (Q-K asymmetry is the Bahdanau pattern.)
Confusing soft attention (continuous, differentiable, weighted sum) with hard attention (discrete sample, REINFORCE).
Claiming the original Transformer used Pre-Norm — it used Post-Norm; modern Transformers (and ViT) switched to Pre-Norm for deep-stack stability.

Shortcuts

Sub-layer count: encoder block = 2 (MSA, FFN); decoder block = 3 (masked MSA, cross-attn, FFN).
$d_{ff} = 4 d_{model}$ in the base Transformer.
Multi-head params = single-head params — heads partition $d$ , don't multiply.
**Scaling: $/ d_{k}$ , not $/ d_{k}$ .** The variance argument requires the square root.
Causal mask synonyms: look-ahead, autoregressive, left-to-right. Same thing.
Cross-attention = Bahdanau attention in Q-K-V form. Memorise the connection.
Sinusoidal PE generalises to longer sequences; learned PE does not (in the trivial implementation).
Teacher forcing = ground-truth previous tokens (training only); student forcing = predicted previous tokens; inference is always student forcing.

Proofs / Algorithms

Self-attention is permutation-equivariant. For permutation $π$ : $Attn (π (X)) = softmax (π (Q) π (K)^{⊤} / d_{k}) π (V) = π (softmax (Q K^{⊤} / d_{k}) V) = π (Attn (X))$ . Hence shuffling inputs shuffles outputs identically — without positional encoding, the Transformer cannot tell "cat sat" from "sat cat".

Causal-mask correctness. After adding $M$ with $M_{ij} = - \infty$ for $j > i$ and softmaxing, the row- $i$ distribution has zero mass on $j > i$ . So $Attn (Q, K, V)_{i} = \sum_{j \leq i} α_{ij} V_{j}$ — output at position $i$ depends only on positions $\leq i$ . Autoregressive property preserved.

Sinusoidal PE: relative position is linear. For frequency $ω_{k} = 1/1000 0^{2 k / d}$ : $sin (ω_{k} (pos + δ)) = sin (ω_{k} pos) cos (ω_{k} δ) + cos (ω_{k} pos) sin (ω_{k} δ)$ . So $PE (pos + δ)$ 's sine component is a linear combination of $sin (ω_{k} pos)$ and $cos (ω_{k} pos)$ , with $δ$ -dependent coefficients. A linear layer can extract relative position $δ$ from absolute PEs.

Computer Vision