VLM Architecture — Encoders, Connectors, Positional Encoding
Intuition
For most of computer vision history, *image* and *text* lived in separate universes. Vision built models that ate pixels and spit out class labels. NLP built models that ate tokens and spit out tokens. The two never talked. Then CLIP → BLIP → Flamingo → LLaVA taught a single network to fuse them. The recipe in 2024 settled into the three-pillar VLM: pretrained vision encoder, small connector, pretrained LLM. Show it a photo and ask in English. Hand it an X-ray and ask the diagnosis. Tell it *"pick up the cup"* and have a robot arm move. PaliGemma → Qwen2-VL → Gemma 4 is three generations of this architecture, each fixing the last's bottleneck.
Explanation
The Modality Gap — the fundamental problem. Text is discrete: tokens are integers in ; embedding via lookup ; vocab ~32k for most LLMs. Images are continuous: with pixel values in ; you cannot index a lookup table — must run a *learned encoder*. The goal of every VLM: when describes — both modalities mapped to the same space with semantic correspondence preserved. Everything else is engineering.
The Three-Pillar Blueprint — every VLM in this unit follows it. *Vision Encoder* → *Connector / Adapter* → *LLM Backbone*. One equation to memorise verbatim: . Visual tokens and text tokens are concatenated into one sequence, then fed to a standard autoregressive LLM. *There is no special cross-attention mechanism — vanilla self-attention across the joint sequence does all the cross-modal work.* Exam-gold: if asked *"how does a VLM perform cross-modal reasoning?"*, the answer is self-attention over the concatenated sequence handles it naturally.
PaliGemma — the canonical didactic VLM (Beyer et al., 2024). Three components: *Vision Encoder* SigLIP-So400m (~400M params, frozen in Stage 1); *Connector* a single linear layer, randomly initialised; *LLM* Gemma-2B decoder (pretrained). Total < 3B parameters. Image tokens produced: 256 / 1024 / 4096 for input resolutions 224 / 448 / 896 respectively.
SigLIP vs CLIP — the loss difference. Same setup as CLIP: two encoders , shared , batch of image-caption pairs, matched diagonal should be high-similarity, off-diagonal low. Difference: CLIP uses *softmax* normalised over the full batch — every pair's logit is normalised against the entire row (synchronisation cost; loss landscape depends on batch size). SigLIP uses independent sigmoid binary cross-entropy for every pair. Step 1: pairwise logits where is learnable temperature and a learnable bias. Step 2: with iff . SigLIP scales to arbitrary batch size without instability. *CLIP softmax → SigLIP sigmoid* is the key transition.
Inside SigLIP — ViT patch tokenisation. Step 1 — patch the image at : ; for input: patches. Step 2 — embed each patch: with , for SigLIP-So400m. Step 3 — Transformer pass: . **Memorise: , .**
The Connector — a single linear layer. SigLIP outputs 1152-d vectors; Gemma-2B expects 2048-d token embeddings. with . **Critical exam point: is the ONLY randomly initialised component of PaliGemma.** Everything else (SigLIP, Gemma) starts from pretrained checkpoints. The connector is *just a dimension change* that learns how to align two pretrained spaces. Result: visual tokens — same shape as text token embeddings.
Prefix-LM — the attention masking strategy. Sequence layout: . Image + prefix = context (e.g. "caption en"): attend *bidirectionally* — every token in this region sees every other. Suffix = the answer to generate: attend *causally* — each suffix token only sees tokens to its left. Mask : if (bidirectional); if and (causal); otherwise. Loss is computed ONLY on suffix tokens. Image and prompt are free context — inputs the model doesn't have to predict.
PaliGemma's three training stages. *Stage 1 — unimodal pretraining:* SigLIP pretrained on image-text pairs (contrastive); Gemma pretrained on text. No joint training yet. *Stage 2 — multimodal pretraining:* freeze SigLIP, train + LLM on a large image-text corpus (WebLI, CC12M, …). Images at → 256 tokens. *Stage 3 — task transfer:* fine-tune on specific tasks (VQA, captioning, detection, segmentation). Unfreeze all components, use higher resolution.
"Everything is text" — PaliGemma's task design. Cleverest design choice: every task output is text, even spatial outputs. *Captioning:* prompt "caption en" → natural sentence. *VQA:* prompt "How many people are visible?" → "3". *Object detection:* prompt "detect person" → " person" — four tokens encode as normalised coordinates. *Referring segmentation:* bbox tokens + codewords from a VQ-VAE codebook. **Vocab extended with 1024 location tokens and 128 segmentation codewords.** Exam-gold: *"how does PaliGemma do detection if it only outputs text?"* → *special location tokens in the extended vocabulary.*
Qwen2-VL — fixing two PaliGemma limitations. *Problem 1 — Resolution bottleneck.* SigLIP processes at fixed size (224, 448, 896). A 3mm nodule on a chest X-ray becomes sub-pixel after resizing. Information-theoretic bound: once downsampled, the detail is irrecoverable. *Problem 2 — Aspect ratio distortion.* A 16:9 photo forced into a 1:1 grid distorts spatial relationships; receipt items appear ~40% shorter. Qwen2-VL fixes both with dynamic resolution + M-RoPE.
Naïve Dynamic Resolution — tile at native aspect ratio. Just stop resizing. Tile the image with patches at its actual size: , , — varies per image. with : tokens. with : tokens. *Practical:* batch collation requires padding to longest length (or dynamic packing — multiple small images share one sequence window). Inference cost scales with → user controls the resolution-speed trade-off. Token budget control: (clamp; resize minimally to satisfy budget).
1D RoPE breaks for images. Raster index . Two spatially adjacent patches at and have differing by 1 — they *look adjacent* — while and differ by — they *look far apart* even though physically adjacent. 1D ordering is incompatible with 2D geometry.
2D-RoPE — encoding spatial position. Split the head dimension into two halves; apply RoPE independently to each — one for rows, one for columns. Patch at : rotate first half by (row position); rotate second half by (column position). **Attention score now encodes 2D relative displacement .** Invariant to image size — generalises to any resolution and aspect ratio.
M-RoPE — Multimodal RoPE (time × height × width). For video, extend to a *third* dimension. Split head dim into thirds. Token at : rotate by (temporal); by (row); by (col). Edge cases: *static images* — for all patches. *Text tokens* — same for all three dimensions: . Why this is the cleverest part: attention between two video patches now depends on — spatio-temporal displacement. The model can reason *"how did the object at move from frame to ?"*
Qwen2-VL vs PaliGemma — the architectural delta table. *Vision encoder:* SigLIP-So400m (400M) vs ViT 600M + 2D-RoPE. *Connector:* linear vs 2-layer MLP (). *PE:* 1D-RoPE vs M-RoPE (). *Token count:* fixed 256/1024/4096 vs dynamic (up to 32k for video). *LLM:* Gemma 2B vs Qwen2 7B/72B. Qwen2-VL-72B matched GPT-4o on most multimodal benchmarks (Oct 2024).
Gemma 4 — close the connector seam. Even Qwen2-VL has a residual problem: vision and language are *still* trained in separate spaces before being stitched. The connector remains an information bottleneck between two latent geometries optimised for *different* objectives — SigLIP for image-text cosine similarity (contrastive); Gemma for next-text-token prediction (LM). A linear connector can only rotate and scale; a 2-layer MLP is more expressive but still a narrow bridge between two independent latent spaces. Gemma 4's answer: native multimodal training from scratch. Vision and language share Transformer blocks from early layers. Visual patches and text tokens are processed by the *same* weight matrices. No dedicated projection layer — modalities meet inside the Transformer. The lecture's framing: *stitched → woven*. Architecture variants: Gemma 4B / 12B / 27B; 128k context; 32-frame video clips inline.
The three-generation comparison table — likely exam target. *Vision Encoder:* SigLIP-So400m (400M) → ViT + 2D-RoPE (600M) → native / integrated. *Resolution:* fixed 224/448/896 → dynamic (native AR) → dynamic, multi-scale. *Connector:* linear → 2-layer MLP → deep fusion (none). *Positional Encoding:* 1D-RoPE (text only) → M-RoPE → unified spatial RoPE. *Position math:* → → . *Token Count:* fixed → dynamic → dynamic multi-res. *Training:* frozen enc + tuned proj → joint tuning → native joint training.
From VLMs to VLAs — when models must act. A VLM can describe a coffee cup in detail; it cannot pick it up. Closing this gap is the point of Vision-Language-Action (VLA) models — robots driven by LLMs. The key insight: actions are just another type of token. Extended vocab . Each continuous action dimension discretised into 256 bins. A 7-DOF arm → new tokens. OpenVLA reuses the 256 least-frequent text tokens for action bins to avoid expanding the embedding table.
Action tokenisation — the math. Discretise each dim with range and bins: . De-tokenise with midpoint: . The centres within the bin. **Quantisation error bounded by .**
OpenVLA architecture (Kim et al., 2024). Two vision encoders concatenated: DinoV2 (strong spatial features, good for manipulation) + SigLIP (strong semantic features, good for instruction following). MLP projector combines them. LLaMA-2 7B backbone. Action de-tokeniser at the output: the last 7 logits are over 256 action bins per dimension; argmax → bin → continuous action. Trained on 970k robot trajectories from Open-X-Embodiment; ~7B params; fine-tunable on 1–2 hours of single-task demonstrations.
VLA training objective. Initialise from a pretrained VLM (inherits language + vision understanding); fine-tune on robot trajectories with a joint loss: where (text CE — keeps language ability) and (CE on each discretised action dim).
The full arc — perceive → align → reason → generate → act. *Perceive:* ViT patches → feature vectors. *Align:* connector → same space as text tokens. *Reason:* LLM attends over with M-RoPE. *Generate:* autoregressive decoder outputs text OR action tokens. *Act:* action de-tokeniser → continuous deltas. PaliGemma / Qwen2-VL / Gemma 4 cover steps 1–4; OpenVLA / RT-2 extend to step 5.
Definitions
- Modality gap — Text is discrete (vocab → lookup); images are continuous (pixels → encoder). A VLM aligns them into a shared latent space.
- Three-pillar blueprint — Vision Encoder → Connector → LLM Backbone. Visual + text tokens concatenated, vanilla self-attention handles cross-modal reasoning.
- Connector / adapter — Small (linear or MLP) projection from vision-encoder output dim to LLM token dim. In stitched VLMs, the only randomly initialised component.
- SigLIP — Sigmoid CLIP. Pairwise binary cross-entropy with logits . Scales without batch-wide softmax sync.
- Prefix-LM mask — Bidirectional attention over image + prompt; causal over answer. Loss computed only on answer tokens.
- Location tokens — 1024 extended-vocab tokens encoding normalised bounding-box coordinates. Detection output is pure text.
- Segmentation codewords — 128 extended-vocab tokens from a learned VQ-VAE codebook. Segmentation mask = bbox + codeword sequence decoded to pixels.
- Dynamic resolution (Qwen2-VL) — Process at native aspect ratio + resolution; tile-count clamped by . Fixes resolution bottleneck + aspect distortion.
- 2D-RoPE — Split head dim in halves; rotate first half by row, second half by column. Attention depends on 2D relative displacement.
- M-RoPE — Multimodal RoPE (Qwen2-VL): head dim split in thirds for rotations. Static images: . Text tokens: all three = token index.
- Native multimodal (Gemma 4) — Vision and language share Transformer blocks from early layers; no connector. Modalities meet inside the Transformer with shared weights.
- VLA (Vision-Language-Action) — VLM + action de-tokeniser. Continuous actions discretised into 256 bins per dimension; treated as just another token type.
- OpenVLA — DinoV2 + SigLIP image encoders + MLP connector + LLaMA-2 7B + action de-tokeniser. Trained on 970k robot trajectories. ~7B params.
Formulas
Derivations
Why CLIP softmax doesn't scale and SigLIP sigmoid does. CLIP loss for one image : — denominator sums over *all candidates in the batch*, so the gradient for image depends on every other text . At distributed scale, this requires an all-to-all softmax sync across GPUs every step, and the optimisation landscape changes with batch size. SigLIP replaces this with — each pair is its own independent binary problem, the loss factorises, no all-to-all sync, and batch size only affects how many *examples* the gradient sees, not the loss *shape*. SigLIP scales to arbitrarily large batches without instability.
Why Prefix-LM beats pure causal for VLMs. With *causal* masking on the entire sequence, image patches can only attend to previously processed patches — and the order is artificial (raster scan). The encoder cannot fully digest the image before the LM starts generating. With *Prefix-LM*, the image + prompt region is bidirectional (encoder-like: every patch sees every patch + every prompt token); only the answer is causal. The model gets *encoder-quality understanding of the input* AND autoregressive generation — best of both. Loss is on the answer only, so the bidirectional region doesn't violate the LM objective.
Why 1D RoPE breaks images, with concrete math. Image of size patches. Raster index: . Two physically adjacent patches *(2, W-1)* and *(3, 0)*: , but they're in different rows. Two physically distant patches *(2, 0)* and *(3, 0)* (same column, adjacent row): . The 1D ordering says 1 patch (across a row wrap-around) is closer than 1 patch (across a row boundary), which is geometrically nonsense. 2D-RoPE separates row and column dimensions so adjacency in either direction is encoded correctly.
Why the connector is the bottleneck in stitched VLMs. SigLIP's latent geometry was shaped by a contrastive objective: nearby vectors mean *visually similar*. Gemma's latent geometry was shaped by a language-modelling objective: nearby vectors mean *similar next-token distribution*. These geometries are not isometric. A linear map can only rotate and scale — at best, it aligns axes; it cannot remap *non-linear* semantic relationships. A 2-layer MLP is more expressive but still a narrow bridge. Gemma 4's solution: don't reconcile separate geometries — train them jointly from scratch so they share a geometry.
Quantisation error in VLA action discretisation. Bin width . Midpoint de-bin: . Max error: half a bin width, . For and a 1-meter arm range, max error is ~2 mm — fine for most manipulation.
Examples
- **PaliGemma forward pass on a image.** Image → SigLIP-So400m → patch features (each 1152-d) → linear connector → 256 patch tokens (2048-d) → concat with text → Gemma decoder under Prefix-LM mask → causal next-token loss on suffix only.
- SigLIP loss on a batch of 4. on the diagonal should be large; off-diagonal should be small. The loss treats each of the 16 pairs independently as a binary classification (is this pair matched?). No batch-wide softmax sync.
- Detection through text. Input: image of a dog + prompt "detect dog". Output: " dog". The four location tokens decode to in normalised coords. The model never outputs floats — only tokens.
- **M-RoPE for a video frame at .** Rotation matrices applied to three thirds of the head dim by angles respectively. Attention between this token and another at depends only on — purely temporal displacement.
- **Qwen2-VL with .** A image at → patches → clamped down. A image → still clamped to . User controls the cost-quality trade-off explicitly.
- OpenVLA inference. Image of mug + instruction "pick up the red mug". DinoV2 + SigLIP encode the image; MLP projects; LLaMA-2 generates: (action bins for ). De-tokeniser converts to continuous action. Repeat at 10 Hz.
- VLA bin error. 7-DOF arm; m; . Bin width: mm. Max error: ~2 mm. Fine for pick-and-place; not for surgical precision.
Diagrams
- Three-pillar VLM (PaliGemma). SigLIP (frozen, 400M) → linear connector (random init, the only trained-from-scratch piece) → Gemma decoder (pretrained, 2B). Annotate frozen vs trained vs random-init.
- Prefix-LM attention mask. A matrix. Upper-left block: all zeros (bidirectional). Suffix region: lower-triangular zeros (causal). Cross blocks: image+prefix can see suffix? No (zeros only where allowed).
- SigLIP vs CLIP loss. Side-by-side. CLIP: matrix → row softmax + col softmax → average. SigLIP: matrix → element-wise sigmoid + BCE per cell → average.
- ViT patch tokenisation. image → grid of patches → flatten + linear → 256 tokens of dim 1152.
- Detection-through-text. Image + prompt "detect cat" → output sequence " cat" → decode locs to bbox.
- M-RoPE head-dim split. A 64-dim head split into three groups of (e.g.) 22 dims; each group rotated by its own .
- Three-generation evolution. Timeline: PaliGemma (linear) → Qwen2-VL (MLP + M-RoPE + dynamic) → Gemma 4 (woven, no connector).
- VLA action de-tokeniser. Last 7 logit outputs → argmax per dim → 7 bin indices → midpoint formula → continuous 7-DOF action.
- The five-stage arc. Perceive (ViT) → Align (connector) → Reason (LLM self-attention) → Generate (decoder) → Act (de-tokeniser).
Edge cases
- **PaliGemma at :** fine detail (3mm nodule on X-ray) is irrecoverably lost — downsampling is one-way.
- Aspect-ratio distortion: resizing a 16:9 photo to 1:1 stretches horizontally; receipt items appear ~40% shorter.
- Connector bottleneck: a linear map can only rotate/scale; non-linear semantic relationships across the modality gap may not survive.
- Qwen2-VL batch padding: mixed-resolution batches require padding to longest length or dynamic packing; naïve implementations waste compute.
- M-RoPE edge cases: static images use ; text tokens use — this allows text and image positions to share the rotation but with degenerate spatial structure for text.
- Action quantisation error is half-bin-width at midpoint de-bin. For very fine manipulation (sub-mm), 256 bins per dim may be insufficient.
- **SigLIP bias term ** is necessary because the sigmoid output range is — without , all near zero produce uniformly, hard to discriminate.
Common mistakes
- Saying *"CLIP and SigLIP have the same loss with different normalisation"* — SigLIP is pairwise sigmoid BCE, fundamentally independent across pairs.
- Forgetting that loss is computed ONLY on suffix tokens in PaliGemma — image+prompt tokens don't contribute to the loss.
- Treating M-RoPE as "RoPE with three encodings stacked" — it splits the head dim into thirds and rotates each by a different position.
- Claiming Qwen2-VL processes fixed — it processes native resolution clamped by .
- Saying VLMs have a *"cross-attention mechanism"* — they don't. Vanilla self-attention over the concatenated sequence does all the cross-modal work.
- Stating PaliGemma outputs *bounding box coordinates as floats* — **it outputs tokens**; coordinates are decoded from token indices.
- Confusing prefix (image + prompt) with suffix (answer) in Prefix-LM; only the suffix is autoregressive and only it contributes to the loss.
- Stating Gemma 4 *"uses an MLP connector"* — Gemma 4 is native multimodal, no connector at all.
Shortcuts
- Three pillars to recite: Vision Encoder + Connector + LLM.
- Connector = only random-init component in stitched VLMs.
- Prefix-LM: bidirectional on prompt, causal on answer, loss on answer only.
- SigLIP: pairwise sigmoid BCE; CLIP: softmax row+col CE.
- **M-RoPE = triple**, head dim split into thirds. Text: all three = token idx.
- Detection through text: encode normalised .
- Three generations: PaliGemma (linear, 1D-RoPE, fixed) → Qwen2-VL (MLP, M-RoPE, dynamic) → Gemma 4 (woven, native).
- VLA = VLM + action de-tokeniser; 7-DOF × 256 bins.
Proofs / Algorithms
SigLIP loss factorises across pairs. — the inner sum over is independent of at the loss-shape level (each term is its own binary CE). In contrast, CLIP's softmax loss has every term depending on the partition — adding negatives changes the loss landscape. Hence SigLIP scales to arbitrary without batch-size-dependent dynamics.
Prefix-LM preserves the autoregressive property of the suffix. For , the mask allows attention only to . Cross-entropy is computed only at suffix positions , where the prediction depends only on tokens at positions . So the conditional probability factorisation holds, making it a valid LM despite the bidirectional prefix region.
M-RoPE preserves spatio-temporal relative position. With head dim split into three groups, the attention score is . Each group contributes a function of for — hence the score depends only on the relative triple , not absolute positions. Generalises across resolutions, aspect ratios, and video lengths.