Revision Notes/Unit 11 — Multimodal LLMs (PaliGemma / Qwen2-VL / Gemma 4)/VLM Architecture — Encoders, Connectors, Positional Encoding/Story

VLM Architecture — Encoders, Connectors, Positional Encoding

NotesStory

Unit 11 — Multimodal LLMs (PaliGemma / Qwen2-VL / Gemma 4)

When Vision Meets Language

For most of computer vision history, *image* and *text* lived in separate universes. Vision researchers built models that ate pixels and spit out class labels. NLP researchers built models that ate tokens and spit out other tokens. The two never talked to each other.

Then a wave of papers — CLIP, BLIP, Flamingo, LLaVA — taught a single neural network to fuse the two. Show it a photo and ask it a question in English. Hand it an X-ray and ask *"what's the diagnosis"*. Tell it *"pick up the cup"* and have it move a robot arm. These are Multimodal Large Language Models — VLMs (vision-language) when they just describe, VLAs (vision-language-action) when they actually do things.

The lecture walks through three generations of architecture — PaliGemma → Qwen2-VL → Gemma 4 — each fixing the previous one's limits, and then extends to VLAs that close the loop with action.

Part I — Why this is hard: the Modality Gap

The fundamental problem is a mathematical mismatch between how text and images are represented:

Text is discrete. Tokens are integers in ${0, 1, \dots, ∣ V ∣ - 1}$ . To get an embedding, you literally look up a row in a learned table: $e = W_{emb} [t] \in R^{d}$ . Vocabulary size ~32k for most LLMs.

Images are continuous. An image is $x \in R^{H \times W \times C}$ with pixel values in $[0, 255]$ . You cannot index a lookup table with a pixel value — there are infinitely many possible images. You must run a learned encoder.

So the goal of every VLM:

f_{text} (w) \approx f_{image} (x) when w describes x

Both modalities mapped into the same $R^{T}$ space, with semantic correspondence preserved. That's the only thing you're doing. Everything else is engineering.

The Three-Pillar VLM Blueprint

Every VLM in this unit follows this skeleton:

$[Vision Encoder] \to [Connector / Adapter] \to [LLM Backbone]$

*Pixels → patch* → *Reshape dims to* → *Sequence reasoning*

*feature vectors* → *match LLM* → *over all tokens*

e.g. SigLIP ViT → Linear / MLP → Gemma 2B

One equation to memorise verbatim:

y = LLM (Concat (Connector (Encoder (x_{image})), Tokenizer (x_{text})))

Visual tokens and text tokens are concatenated into one sequence, then fed to a standard autoregressive LLM. There is no special "cross-attention" mechanism — vanilla self-attention across the joint sequence does all the cross-modal work. This is the key insight. If a question asks *"how does a VLM perform cross-modal reasoning,"* the answer is: *it doesn't need a separate mechanism — self-attention over the concatenated sequence handles it naturally.*

Part II — PaliGemma: the simplest VLM that works

PaliGemma (Beyer et al., 2024) is the canonical didactic VLM. Memorise its three components:

| Component | Choice | Parameters | | --- | --- | --- | | Vision Encoder | SigLIP-So400m | 400M, frozen in Stage 1 | | Connector | Single linear layer, randomly initialised | small | | LLM | Gemma-2B decoder, pretrained | 2B | | Total | | < 3B |

Image tokens produced: 256 / 1024 / 4096 for input resolutions 224 / 448 / 896 respectively.

SigLIP — the vision encoder

SigLIP is a contrastive image-text encoder, conceptually identical to CLIP, with one critical mathematical difference in the loss.

Setup: two encoders, one shared space. $f_{I} (x) \to R^{d}$ (vision encoder, ViT-So400m, 400M params). $f_{T} (t) \to R^{d}$ (text encoder, Transformer).

Difference from CLIP: CLIP uses softmax normalised over the full batch — every pair's logit is normalised against the entire row. SigLIP uses independent sigmoid binary cross-entropy for every pair.

Step 1 — pairwise logits:

z_{ij} = τ \cdot f_{I} (x_{i})^{⊤} f_{T} (t_{j}) + b

where $τ$ is a learnable temperature (log-parameterised for stability) and $b$ a learnable bias.

Step 2 — independent sigmoid BCE:

L = - \frac{1}{n ^{2}} i, j \sum [y_{ij} lo g σ (z_{ij}) + (1 - y_{ij}) lo g (1 - σ (z_{ij}))]

with $y_{ij} = 1$ iff $i = j$ .

Why this matters: softmax normalises across the batch — changing batch size changes the loss landscape. Sigmoid treats every pair as its own independent binary classification problem. Result: SigLIP scales to arbitrary batch size without instability. *CLIP softmax → SigLIP sigmoid* is the key transition.

Inside SigLIP — ViT patch tokenisation

Step 1 — patch the image. SigLIP uses $P = 14$ : $N = (H / P) (W / P)$ . For $224 \times 224$ : $N = 1 6^{2} = 256$ .

Step 2 — embed each patch:

z_{i} = p_{i} \cdot E + e_{i}^{pos}, E \in R^{P^{2} C \times D}

$D = 1152$ for SigLIP-So400m.

Step 3 — Transformer:

H = ViT ([z_{1}, \dots, z_{N}]) \in R^{N \times D_{enc}}

**Memorise: $N = 256$ , $D_{enc} = 1152$ .**

The Connector — a single linear layer

SigLIP outputs 1152-d vectors; Gemma-2B expects 2048-d. So:

v_{i} = W_{proj} H_{i} + b_{proj}, W_{proj} \in R^{2048 \times 1152}

**Critical exam point: $W_{proj}$ is the ONLY randomly initialised component of PaliGemma.** Everything else starts from pretrained checkpoints. The connector is "just a dimension change" that has to learn how to align two pretrained spaces.

Prefix-LM masking

The input sequence is laid out:

$[img_{1} \dots img_{N} ∣ BOS ∣ t_{1} \dots t_{P} ∣ SEP ∣ s_{1} \dots s_{M}]$

Image tokens | Prefix (prompt) | Suffix (answer)

Image + prefix = context. Attend bidirectionally — every token sees every other.
Suffix = the answer to generate. Attend causally — each token only sees tokens to its left.

Loss is computed only on suffix tokens. Image and prompt are free context — inputs the model doesn't have to predict.

PaliGemma's three training stages

1. Stage 1 — Unimodal pretraining. SigLIP on image-text pairs (contrastive). Gemma on text. No joint training. 2. Stage 2 — Multimodal pretraining. Freeze SigLIP, train $W_{proj}$ + LLM on a large image-text corpus. Images at $22 4^{2}$ → 256 tokens. 3. Stage 3 — Task transfer. Fine-tune on specific tasks. Unfreeze all components, higher resolution.

The "everything is text" trick

PaliGemma's cleverest design choice: every task output is text — even spatial outputs.

Image Captioning — Prompt: "caption en". Output: natural sentence.
VQA — Prompt: "How many people?". Output: "3".
Object Detection — Prompt: "detect person". Output: "<loc0210><loc0142><loc0867><loc0821> person". The four $⟨ locXXXX ⟩$ tokens encode $[y_{min}, x_{min}, y_{max}, x_{max}]$ .
Referring Segmentation — Prompt: "segment the hammock". Output: bbox tokens + $⟨ seg042 ⟩ ⟨ seg018 ⟩ \dots$ codewords.

Vocabulary extended with 1024 location tokens $⟨ loc0000 ⟩ \dots ⟨ loc1023 ⟩$ and 128 segmentation codewords. Exam-gold: *how does PaliGemma do detection if it only outputs text?* → special location tokens in the extended vocabulary.

Part III — Qwen2-VL: dynamic resolution + spatial encoding

PaliGemma works, but two fixable problems:

Problem 1 — Resolution bottleneck. SigLIP processes at fixed size (224/448/896). A 3mm nodule on a $204 8^{2}$ chest X-ray becomes sub-pixel after resizing. Once downsampled, the detail is irrecoverable.

Problem 2 — Aspect ratio distortion. A 16:9 photo forced into a 1:1 grid distorts spatial relationships. Receipt items appear ~40% shorter.

Qwen2-VL fixes both with dynamic resolution + M-RoPE.

Naïve Dynamic Resolution — tile at native aspect ratio

Stop resizing. Tile the image with patches at its actual size:

N_{h} = H_{orig} / P, N_{w} = W_{orig} / P, N = N_{h} \cdot N_{w}

Varies per image. $448 \times 224$ at $P = 14$ : $N = 32 \cdot 16 = 512$ . $896 \times 896$ at $P = 14$ : $N = 4096$ .

Practical implications: batch collation needs padding to longest length (or dynamic packing). Inference cost scales with $N$ — *user controls the speed-quality trade-off*. Token budget control: $N_{min} \leq N \leq N_{max}$ .

1D RoPE breakdown for images

With raster index $m = row \cdot W + col$ , two spatially adjacent patches at $(2, W - 1)$ and $(3, 0)$ have $m$ differing by 1 — they *look adjacent* — while $(2, 0)$ and $(3, 0)$ differ by $W$ — they look *far apart*, even though physically adjacent. The 1D ordering is incompatible with 2D geometry.

2D RoPE — encoding spatial position

Split the head dimension $d$ into two halves; apply RoPE independently to each — one for rows, one for columns. Patch at grid position $(r, c)$ :

Apply $R_{m_{h} = r}$ to first half $[0 : d /2]$ — encodes row position.
Apply $R_{m_{w} = c}$ to second half $[d /2 : d]$ — encodes column position.

**Attention score now encodes 2D relative displacement $(Δ r, Δ c)$ .** Invariant to image size — generalises to any resolution and aspect ratio.

M-RoPE — Multimodal RoPE (time × height × width)

For video, extend to a third dimension. Split head dim $d$ into thirds:

Token at $(t, r, c)$ :

Apply $R_{m_{t}}$ to first third $[0 : d /3]$ — temporal position.

Apply $R_{m_{r}}$ to second third $[d /3 : 2 d /3]$ — row position.

Apply $R_{m_{c}}$ to third third $[2 d /3 : d]$ — col position.

Edge cases:

*Static images:* $m_{t} = 0$ for all patches.
*Text tokens:* same $m$ for all three: $m_{t} = m_{r} = m_{c} = token_idx$ .

Why this is the cleverest part of Qwen2-VL: attention between two video patches now depends on $(t_{1} - t_{2}, r_{1} - r_{2}, c_{1} - c_{2})$ — spatio-temporal displacement. The model can reason *"how did the object at $(r, c)$ move from frame $t_{1}$ to $t_{2}$ ?"*

Qwen2-VL vs PaliGemma — the table

| | PaliGemma | Qwen2-VL | | --- | --- | --- | | Vision encoder | SigLIP-So400m (400M) | ViT 600M + 2D-RoPE | | Connector | Linear $W \in R^{2048 \times 1152}$ | 2-layer MLP | | Position encoding | 1D-RoPE | M-RoPE $(T \times H \times W)$ | | Token count | Fixed 256/1024/4096 | Dynamic $N_{h} \times N_{w}$ | | LLM | Gemma 2B | Qwen2 7B / 72B |

Qwen2-VL-72B matched GPT-4o on most multimodal benchmarks (October 2024).

Part IV — Gemma 4: close the connector seam

Even Qwen2-VL has a residual problem: vision and language are still trained in separate spaces before being stitched together. The connector remains an information bottleneck between two latent geometries optimised for *different* objectives:

SigLIP was trained for image-text cosine similarity (contrastive).
Gemma was trained to predict the next text token (language modelling).

A linear connector can only rotate and scale. A 2-layer MLP is more expressive but still a "narrow bridge between two independent latent spaces." Non-linear semantic relationships may not survive the projection.

Gemma 4's answer: native multimodal training from scratch. Vision and language share Transformer blocks from early layers. Visual patches and text tokens are processed by the *same* weight matrices. No dedicated projection layer — modalities meet inside the Transformer.

The lecture calls this the shift from "stitched" → "woven".

The three-generation comparison table

| Dimension | PaliGemma | Qwen2-VL | Gemma 4 | | --- | --- | --- | --- | | Vision Encoder | SigLIP-So400m (400M) | ViT + 2D-RoPE (600M) | Native / integrated | | Resolution | Fixed 224/448/896 | Dynamic (native AR) | Dynamic, multi-scale | | Connector | Linear $W$ | 2-layer MLP | Deep fusion (none) | | Positional Encoding | 1D-RoPE (text only) | M-RoPE $(T, H, W)$ | Unified spatial RoPE | | Position math | $m \in Z$ | $(m_{h}, m_{w}) \in Z^{2}$ | $(m_{t}, m_{h}, m_{w}) \in Z^{3}$ | | Token Count | Fixed 256/1024/4096 | Dynamic $N_{h} \times N_{w}$ | Dynamic, multi-res | | LLM Backbone | Gemma-2B | Qwen2 7B / 72B | Gemma 4B–27B | | Training Paradigm | Frozen enc + tuned proj | Joint tuning | Native joint training |

Part V — From VLMs to VLAs: when models must act

A VLM can describe a coffee cup in precise detail. It cannot pick it up. Closing this gap is the point of Vision-Language-Action (VLA) models — robots driven by LLMs.

The key insight: actions are just another type of token

VLM vocabulary: $V = {’the’, ’cup’, ’is’, \dots, ⟨ loc0042 ⟩, \dots}$

VLA vocabulary: $V^{'} = V \cup {Δ x_{bin_{0}}, \dots, Δ grip_{bin_{255}}}$

Each continuous action dim is discretised into 256 bins. A 7-DOF arm × 256 bins = 1792 new tokens. OpenVLA reuses the 256 least-frequent text tokens for this to avoid expanding the embedding table.

Action tokenisation — the math

bin (a_{d}) = ⌊ \frac{a _{d} - a _{min}}{a _{max} - a _{min}} \cdot B ⌋, B = 256

De-tokenise with midpoint:

\overset{a}{^}_{d} = a_{min} + \frac{b + 0.5}{B} (a_{max} - a_{min})

The $+ 0.5$ centres within the bin. Quantisation error bounded by $(a_{max} - a_{min}) / (2 B)$ .

OpenVLA architecture (Kim et al., 2024)

A specific exemplar, with one twist worth knowing:

Two vision encoders concatenated: DinoV2 (strong spatial features, good for manipulation) + SigLIP (strong semantic features, good for instruction following).
MLP projector combines them.
LLaMA-2 7B backbone.
Action de-tokeniser at the output: the last 7 logits are over the 256 action bins per dimension; argmax → bin → continuous action.

VLA training objective

Initialise from a pretrained VLM (inherits language + vision understanding), then fine-tune on robot trajectories with a joint loss:

L = L_{LM} + λ \cdot L_{action}

L_{LM} = - t \sum lo g P (y_{t} ∣ x_{img}, instr, y_{< t}) (CE on text — keeps language ability)

L_{action} = - t \sum d \sum lo g P (a_{d, t} ∣ x_{img}, instr, a_{< t}) (CE on each action dim)

OpenVLA was trained on 970k robot trajectories from Open-X-Embodiment, ~7B params, fine-tunable on 1–2 hours of single-task demonstrations.

The full arc — perceive → align → reason → generate → act

👁 Perceive — ViT patches image into $N$ feature vectors $H \in R^{N \times D_{enc}}$ .

⇔ Align — Connector $W$ : $H \to V \in R^{N \times D_{llm}}$ , same space as text.

🧠 Reason — LLM attends over $[V ∥ text tokens]$ with M-RoPE geometry.

✍ Generate — Autoregressive decoder outputs text OR action tokens.

🤖 Act — Action de-tokeniser maps bins to continuous $Δ (x, y, z, roll, pitch, yaw, grip)$ .

PaliGemma / Qwen2-VL / Gemma 4 cover steps 1–4. OpenVLA / RT-2 extend to step 5.

What you carry into the exam

The Modality Gap (text discrete, images continuous). The Three-Pillar Blueprint and the single equation — no special cross-attention; vanilla self-attention over the concatenated sequence. PaliGemma's three components and < 3B parameter count. SigLIP vs CLIP — pairwise sigmoid vs softmax over batch. ViT patch math: $N = 256$ , $D_{enc} = 1152$ . The connector is the ONLY random-init piece. Prefix-LM mask: bidirectional on image+prompt, causal on answer, loss on answer only. PaliGemma's three stages (unimodal → multimodal → task transfer). "Everything is text" with $⟨ locXXXX ⟩$ tokens for detection and segmentation codewords. Qwen2-VL's dynamic resolution and M-RoPE. The breakdown of 1D RoPE for images. M-RoPE edge cases (static $t = 0$ ; text $t = r = c$ ). The three-generation comparison table. Gemma 4 stitched → woven. VLAs: actions as tokens, 7-DOF × 256 bins, bin/de-bin math, joint loss. OpenVLA's DinoV2 + SigLIP + LLaMA-2 stack. The five-stage perceive-align-reason-generate-act arc.

That's the heaviest unit in the course. You now have every equation the exam can possibly ask you to write.

Computer Vision