Revision Notes/Unit 12 — Video Understanding/Video Architectures — 3D CNNs, Two-Stream, SlowFast, ViViT, TimeSformer

Video Architectures — 3D CNNs, Two-Stream, SlowFast, ViViT, TimeSformer

Intuition

For thirteen units, you've extracted meaning from a single frozen rectangle of pixels. But the world doesn't hold still. A person *walking* is not the same as a person *standing*. A glass *falling* is not the same as a glass *resting*. Time is the missing dimension. A video is a stack of images — technically — but that's like saying a sentence is a stack of letters. Video understanding asks: how do you teach a network to see motion, sequence, and intent? Two halves to treat as separate exam topics: *(1)* the landscape of tasks and datasets, and *(2)* the architectural families that solve them.

Explanation

Where videos come from — the two-axis taxonomy. *Edited vs unedited* × *first-person (ego) vs third-person*. Edited third-person: movies, sports, HowTo. Unedited third-person: mobile recordings, livestreaming. First-person (ego): VLOGs, body-cam footage. *Why this matters:* methods that work on clean edited movie clips often crash on shaky GoPro footage. The type of video constrains the architecture.

The dataset chronology — four eras. *Era 1 — The Golden Years (silhouette-based):* KTH (Laptev, ICCV 2003) — six simple actions (walk, jog, run, box, hand-wave, hand-clap) on plain backgrounds; tiny. Weizmann (Gorelick, ICCV 2005) — actions as space-time shapes; also tiny. Lab-grade testbeds before deep learning. *Era 2 — Sub-100 classes:* HMDB-51 (51 classes, ICCV 2011), UCF-101 (101 classes, 2012), Something-Something (174 classes, ICCV 2017). UCF/HMDB are YouTube/movie clips. Something-Something focuses on *fine-grained interactions* ("putting something on top of something") — tests whether models understand physics rather than memorising scene context.

Era 3 — The "ImageNet for videos": Kinetics (Carreira & Zisserman, CVPR 2017) — hundreds of thousands of clips, 400/600/700 action classes depending on version. This is the answer when somebody asks "is there an ImageNet for videos?" — yes, Kinetics. Most modern video models pretrain on it. *Era 4 — Beyond classification:* AVA (Gu, CVPR 2018) — *Atomic Visual Actions*, spatio-temporally localised (every action has bbox + time interval + label across pose / person-object / person-person). Charades ("Hollywood in Homes," everyday home activity, crowdsourced). MSR-VTT (10k clips with captions). LSMDC (movie clips with audio descriptions; used for *identity-aware captioning*).

Task taxonomy — six tasks, in lecture order. *(1)* Action Classification — "is this person dancing?" One label per clip. Datasets: UCF, HMDB, Kinetics. *(2)* Temporal Action Localization — "*when* in this untrimmed video?" Return time intervals like "dance: 12.4s–18.7s". Detection in time. *(3)* Spatio-temporal Action Localization — "when AND where?" Bounding boxes that exist during specific time intervals. AVA is canonical. *(4)* Video Captioning — one NL sentence per clip (MSR-VTT, LSMDC). *(5)* Text-to-Video Retrieval — "find me a clip matching this description." Inverse of captioning; requires joint text-video embedding. *(6)* Video Situation Recognition — structured roles (agent, patient, tool).

The long-form frontier. Real videos are hours long. Dense Event Captioning (Krishna, ICCV 2017) — list events with time intervals AND captions; multiple overlapping events. Identity-aware Captioning (LSMDC) — captions that name specific characters consistently across a long video ("John walks in" not "a man walks in"); requires re-identification.

The six challenges of action understanding (Prof's list — short-answer gold). *Who* is doing the action? *When* does it start? *How long* is it? *Actions vs interactions* (one person dancing vs two)? What are the *essential components*? What is the role of the background scene? A person in a kitchen is *probably cooking* — the scene leaks the answer. A model can score 80% on Kinetics by looking only at the room. This is why Something-Something is harder — backgrounds intentionally uninformative.

Family 1 — Frame-by-frame 2D CNNs (the naïve baseline). Treat each video as 30 independent images. Run ResNet50 on each frame. Average predictions. Works surprisingly well on Kinetics — many actions are recognisable from a single frame (kitchen → cooking). But blind to motion: *walking forward* and *walking backward* look identical to this baseline. So we need temporal modelling.

Family 2 — 3D Convolutions (Conv3D). Natural generalisation: 2D conv slides a $K_{H} \times K_{W}$ kernel; 3D conv slides $K_{T} \times K_{H} \times K_{W}$ over a video volume to capture spatio-temporal neighbours. Exam-bait shapes: Input: $B \times C_{in} \times T_{in} \times H_{in} \times W_{in}$ . Filter: $C_{out} \times C_{in} \times K_{T} \times K_{H} \times K_{W}$ weights, $C_{out}$ biases. Output: $B \times C_{out} \times T_{out} \times H_{out} \times W_{out}$ . C3D (Tran, ICCV 2015) first scaled this. Then I3D.

Family 3 — I3D (Inflated 3D ConvNets), Carreira & Zisserman, CVPR 2017. *The clever observation:* training 3D ConvNets from scratch is hard because video datasets are smaller than ImageNet. But pretrained 2D ConvNets exist. The inflation trick: take a pretrained 2D CNN (e.g., Inception); *inflate* every 2D filter into a 3D filter by repeating the 2D weights along a new time dimension — a $K \times K$ filter becomes $K_{T} \times K \times K$ where the same 2D weights are stacked $K_{T}$ times **then divided by $K_{T}$ to preserve activation magnitude. Fine-tune on Kinetics. Result: warm-start with image features that already understand objects/scenes/textures; only the temporal part has to be learned. I3D + Kinetics pretraining is the classical recipe for video classification** — like ImageNet + ResNet for images.

Family 4 — Two-Stream Networks (Simonyan & Zisserman, NeurIPS 2014). A different attack: separate appearance from motion entirely. Two parallel CNNs. *Spatial stream:* takes a single RGB frame → captures appearance (what objects, what scene). *Temporal stream:* takes a stack of optical flow fields between consecutive frames → captures motion (how things move). Late fusion: average the streams' predictions.

Optical flow — exam definition. A dense pixel-level field of 2D vectors $(u, v)$ describing how each pixel moved between frame $t$ and frame $t + 1$ . The *explicit motion signal* that frame-by-frame ConvNets miss. **Two-stream's temporal input has $2 L$ channels** (u, v per flow field × $L$ frames), not $L$ . Why two separate streams instead of one network? Optical flow is a handcrafted feature that already isolates motion — making the network's job easier. Spatial stream learns *what's in the scene*, temporal stream learns *how it's moving*; combining gives a strong action recognizer.

Family 5 — CNN + RNN (LRCN, Donahue et al., CVPR 2015). *Long-term Recurrent Convolutional Networks.* Use a CNN as a frame encoder, then feed the sequence of frame features to an LSTM that reasons over time. Frame $t$ → CNN → feature $_{t}$ → LSTM → action label / caption. Factors the problem: CNN handles space, RNN handles time. Excellent for variable-length outputs like captions, where fixed-clip 3D CNNs struggle.

Family 6 — SlowFast (Feichtenhofer et al., CVPR 2019). One of the cleverest video architectures. The insight: objects and motion live at different timescales. *Category of an object* (car, person, dog) changes slowly — you don't need every frame to know it's a person. *Motion* (running vs walking) changes fast — you need fine temporal resolution to distinguish them. SlowFast runs two parallel pathways: *Slow pathway* — low frame rate (e.g., 4 frames), high channel capacity → rich spatial/semantic features. *Fast pathway* — high frame rate (e.g., 32 frames), low channel capacity → motion patterns. Lateral connections let the Fast pathway inject motion info into the Slow pathway during processing. The whole thing is one network with two branches at different temporal speeds.

Family 7 — Video Transformers. ViViT (Arnab, ICCV 2021) — Video Vision Transformer. Generalises ViT's patch tokenisation to *spatio-temporal tubes*. Instead of cutting the image into $16 \times 16$ spatial patches, cut the video into $t \times 16 \times 16$ 3D tubelets and treat each as a token. Two token-extraction strategies: *uniform frame sampling* (treat each frame as ViT input, concat patches) or tubelet embedding ( $t \times h \times w$ 3D tubelet → linear projection → token, carries spatio-temporal info from the start). Standard Transformer encoder over tokens. ViViT explores different *attention factorisations*: full joint space-time (expensive) vs factorised (cheaper).

TimeSformer (Bertasius, ICML 2021). Asks: *"Is space-time attention all you need for video?"* Yes — with careful choice of *where* to attend. Four variants tested: *(a)* space-only (per-frame); *(b)* joint space-time (every token attends to every other — $O ((T N)^{2})$ , prohibitive); *(c)* divided space-time (temporal attention first — each spatial location attends across time — then spatial attention within each frame); *(d)* sparse local. Divided won — linear-ish cost, best accuracy/efficiency. Memorise the recipe: *"divided attention — temporal then spatial, alternating."*

CNNs vs Transformers — the comparison. *CNN family (I3D, SlowFast):* convolutions, pooling layers, local receptive fields, inductive bias toward locality, better with small data. *Transformer family (ViViT, TimeSformer):* patches/tubelets as tokens, self-attention across tokens, global modelling, less inductive bias, better with very large data. Same trade-off as ViT vs ResNet in the image world.

The exam-bait quiz from your slides. *"You downloaded a 3D ConvNet that expects 64-frame inputs. Your videos range from 48–240 frames. Suggest 1 idea for shorter videos and 2 ideas for longer ones."* Shorter (e.g., 48 frames): *pad* — repeat last frame, loop, or zero-pad; OR *temporally upsample* — interpolate to 64. Longer (e.g., 240 frames): *uniform subsampling* — pick every 4th frame; *sliding window* — process as multiple 64-frame chunks with stride, aggregate via mean/max pool; *random temporal crop during training + multi-crop average at test*.

Definitions

Action classification — One label per trimmed clip — "is this dancing?" Datasets: UCF, HMDB, Kinetics.
Temporal action localisation — Return start/end intervals of actions in an untrimmed video. "Dance: 12.4s–18.7s".
Spatio-temporal action localisation — Bounding box AND time interval per action. AVA is canonical.
Kinetics — Carreira & Zisserman, CVPR 2017. 400/600/700 action classes. The ImageNet for videos — pretraining target for nearly every modern video model.
AVA — Atomic Visual Actions (Gu, CVPR 2018). Spatio-temporally localised — box + time interval + label across pose, person-object, person-person categories.
Optical flow — Per-pixel 2D vector $(u, v)$ describing motion between consecutive frames. Explicit motion signal that RGB-only models miss.
3D convolution (Conv3D) — Convolution with a $K_{T} \times K_{H} \times K_{W}$ kernel sliding over a $T \times H \times W$ video volume. Captures spatio-temporal neighbours.
I3D (Inflated 3D ConvNet) — Carreira & Zisserman, CVPR 2017. Take a 2D ImageNet CNN, inflate each $K \times K$ filter to $K_{T} \times K \times K$ by replicating along time and dividing by $K_{T}$ . Fine-tune on Kinetics.
Two-Stream networks — Simonyan & Zisserman, NeurIPS 2014. Parallel spatial (RGB) + temporal (optical flow, $2 L$ channels) CNNs, late fusion.
LRCN — Long-term Recurrent Convolutional Networks (Donahue et al., CVPR 2015). Per-frame CNN encoder → LSTM. Good for variable-length outputs (captions).
SlowFast — Feichtenhofer et al., CVPR 2019. Slow pathway (low fps, high channels — semantics) + Fast pathway (high fps, low channels — motion) with lateral fusion.
ViViT — Video Vision Transformer (Arnab, ICCV 2021). Two token-extraction strategies: per-frame patches or 3D *tubelets* ( $t \times h \times w$ ). Explores attention factorisations.
Tubelet embedding — ViViT's strategy of linearly projecting 3D spatio-temporal cubes (e.g., $2 \times 16 \times 16$ ) into tokens — encodes spatio-temporal info from the start.
Divided space-time attention — TimeSformer's winning factorisation: each block does temporal attention (per spatial location across time) then spatial attention (per frame). Near-linear cost.
Dense Event Captioning — Krishna, ICCV 2017. Given a long video, output a list of events with time intervals AND captions; multiple overlapping events allowed.

Formulas

$3D Conv input: B \times C_{in} \times T_{in} \times H_{in} \times W_{in}$
$3D Conv filter: C_{out} \times C_{in} \times K_{T} \times K_{H} \times K_{W}$
$3D Conv output: B \times C_{out} \times T_{out} \times H_{out} \times W_{out}$
$I3D inflation: W_{3D} (t, h, w) = \frac{1}{K _{T}} W_{2D} (h, w) for all t$
$Two-Stream temporal input channels: 2 L (u, v per flow field, L frames)$
$Joint space-time attention cost: O ((T N)^{2} \cdot d)$
$Divided space-time cost: O (T^{2} N + T N^{2}) \cdot d$
$ViViT tubelet token count: N_{tok} = (T / t) \cdot (H / h) \cdot (W / w)$
$Sliding window aggregation: \overset{y}{^} = \frac{1}{K} k = 1 \sum K softmax (model (chunk_{k}))$

Derivations

**Why divide by $K_{T}$ in I3D inflation.** A 2D conv output at one location is $\sum_{h, w} W_{2D} (h, w) \cdot x (h, w)$ . The inflated 3D version, with the 2D weights replicated $K_{T}$ times along time, applied to a *boring video* of a static repeating image, computes $\sum_{t} \sum_{h, w} W_{2D} (h, w) \cdot x (t, h, w) = K_{T} \cdot (2D activation on each frame)$ . To keep activation magnitudes matched (so the inflated 3D net on a boring video produces the same activations as the 2D net on the original image), divide the 3D weights by $K_{T}$ . This is what makes inflation a *sensible initialisation* rather than an arbitrary scaling.

Optical flow's role as motion isolation. A 2D CNN on RGB frames must learn both *what* (objects, textures) and *how it moves* (motion patterns). Pre-computing optical flow $(u, v)$ between consecutive frames produces a representation in which the *motion* is already explicit — the network on flow inputs only has to learn which motion patterns correspond to which actions. Decomposition reduces learning burden; the spatial stream covers appearance independently. Late fusion combines the two.

SlowFast's parameter asymmetry. Slow path: $T_{s} = 4$ frames, $C_{s} = 64$ channels → cost $\propto T_{s} \cdot C_{s}^{2} = 4 \cdot 4096 = 16384$ . Fast path: $T_{f} = 32$ frames, $C_{f} = 8$ channels → cost $\propto T_{f} \cdot C_{f}^{2} = 32 \cdot 64 = 2048$ . Fast path is ~8× cheaper despite seeing 8× more frames — by carrying fewer channels. Motion is *low-dimensional* (a velocity field) — high channel capacity is wasted on it.

Divided space-time is near-linear. Joint attention over $T \times N$ tokens: $(T N)^{2} = T^{2} N^{2}$ pairwise scores per layer. Divided: temporal attention (each of $N$ spatial locations attends across $T$ frames) costs $N \cdot T^{2}$ ; spatial attention (each of $T$ frames does its own $N \times N$ attention) costs $T \cdot N^{2}$ . Total: $O (T^{2} N + T N^{2})$ — for $T = 8, N = 196$ : joint ~ $2.4 \times 1 0^{6}$ scores, divided ~ $0.32 \times 1 0^{6}$ . ~8× cheaper at moderate sizes; ratio improves with $T, N$ .

Variable-length-clip strategies — the trade-offs. *Shorter via padding:* doesn't add information; safe baseline. *Shorter via interpolation:* introduces synthetic frames; can confuse motion-sensitive models. *Longer via subsampling:* fast; *loses fast motion* (e.g., a 240-frame golf swing sub-sampled to 64 frames misses the impact moment). *Longer via sliding window + aggregation:* preserves resolution; $O (chunks)$ cost. *Multi-crop ensemble:* highest accuracy; highest inference cost.

Examples

Dataset chronology in one timeline. 2003: KTH (6 classes). 2005: Weizmann. 2011: HMDB-51. 2012: UCF-101. 2017: Something-Something (174 fine-grained). 2017: Kinetics (400/600/700). 2018: AVA (spatio-temporal). 2017: Dense Event Captioning. 2016: MSR-VTT.
3D conv shape calculation. Input $B = 4, C_{in} = 3, T_{in} = 32, H = W = 224$ . Filter $C_{out} = 64, K_{T} = 3, K_{H} = K_{W} = 7$ , stride 2, padding (1, 3, 3). Output: $B = 4, C_{out} = 64, T_{out} = 16, H = W = 112$ . Per-filter parameters: $3 \cdot 3 \cdot 7 \cdot 7 = 441$ ; total weights: $64 \cdot 441 = 28, 224$ ; biases: 64.
I3D inflation example. Inception block has a $1 \times 7$ conv; inflated to $3 \times 1 \times 7$ with weights replicated 3 times along time, divided by 3. A 32-frame static-image video produces activations identical to running 2D Inception on the single image — the I3D net has been initialised to *mimic* the 2D net.
Two-Stream temporal input. $L = 10$ optical flow fields → $2 L = 20$ input channels to the temporal CNN. Each flow field is a $(u, v)$ pair giving per-pixel motion between consecutive frames.
SlowFast on a 64-frame clip. Slow path: stride 16 → 4 frames at $C_{s} = 64$ . Fast path: stride 2 → 32 frames at $C_{f} = 8$ . After 4 stages with lateral connections, both pathways' features are concatenated for the classification head.
TimeSformer divided block. 8 frames × 14×14 = 1568 tokens. Temporal attention: each of 196 spatial locations attends over 8 time steps → 196 separate $8 \times 8$ attention computations. Spatial attention: each of 8 frames does its own $196 \times 196$ attention. Total cost per layer: $196 \cdot 8^{2} + 8 \cdot 19 6^{2} \approx 320$ k pairs — vs joint $156 8^{2} \approx 2.4$ M. ~8× cheaper.
Background bias on Kinetics. A model that classifies the *room* (kitchen, gym, court) scores ~80% top-1 on Kinetics without looking at the person. Same model on Something-Something scores ~20% because backgrounds are intentionally generic.
Variable-length quiz worked answer. Clip is 48 frames, model wants 64: *loop and pad* (49–64 = frames 1–16 repeated). Clip is 240 frames: *sliding windows* of 64 frames with stride 32 → 6 windows; predict on each, average softmaxes.

Diagrams

The two-axis dataset taxonomy. $2 \times 2$ grid: rows *edited / unedited*, columns *third-person / first-person*. Cells: movies/sports, livestreaming, edited VLOGs, body-cam.
Dataset chronology timeline. KTH/Weizmann (2003–05) → HMDB/UCF (2011–12) → Something-Something (2017) → Kinetics (2017) → AVA (2018).
3D conv illustration. Input volume $T \times H \times W$ ; filter cube $K_{T} \times K_{H} \times K_{W}$ slides over the volume; output volume size depends on stride and padding.
I3D inflation. 2D filter $K \times K$ at left, arrow to 3D filter $K_{T} \times K \times K$ at right, with annotation *replicate along T, divide by $K_{T}$ *. The activation-preserving normalisation.
Two-Stream architecture. Top branch: RGB frame → CNN → softmax (spatial). Bottom branch: $2 L$ -channel optical-flow stack → CNN → softmax (temporal). Late fusion: average softmaxes.
SlowFast pathways. Two parallel pathways: top *Slow* (4 frames, 64 channels), bottom *Fast* (32 frames, 8 channels). Lateral connections (arrows from Fast → Slow at each stage). Final concat into classification head.
ViViT token extraction. Two strategies side-by-side: (a) per-frame patch tokens stacked; (b) 3D tubelets ( $t \times h \times w$ ) linearly projected to spatio-temporal tokens.
TimeSformer divided attention. One block: temporal MSA (token attends to its spatial twins across $T$ frames) → spatial MSA (token attends within its frame) → MLP. Annotate the cost.
Sliding window for long clips. 240-frame video split into 6 overlapping 64-frame chunks with stride 32; per-chunk softmax outputs averaged into a final prediction.

Edge cases

Scene bias on Kinetics — temporal-shuffling sanity checks reveal models that ignore motion (shuffled-frame accuracy ≈ original-frame accuracy → relying on scene only).
Optical flow is expensive to pre-compute (Farneback, TV-L1). Modern methods predict flow inside the network (RAFT) or skip flow entirely (I3D, SlowFast).
Long video memory. $O ((T N)^{2})$ joint attention is infeasible past ~30s of video at standard patch sizes. Mitigations: temporal subsampling, memory tokens, hierarchical attention.
Variable frame rate. A 24-fps movie and a 60-fps GoPro clip cover the same motion in different #frames; uniform subsampling must respect this or motion-sensitive models drift.
I3D inflation only works if the 2D pretrained net's spatial structure is preserved — depthwise-separable convs or strange topologies (e.g., NASNet) inflate poorly.
SlowFast lateral fusion must respect the channel-count asymmetry — fuse via $1 \times 1$ conv on the Fast path's features before adding to Slow.
TimeSformer divided attention trades a small accuracy hit (~1%) for the $8 \times$ speedup; for ultra-high accuracy, joint attention is still better on smaller clips.

Common mistakes

Treating video as just stacked images — temporal patterns (motion, sequence) are lost.
Confusing C3D, I3D, and SlowFast — C3D is 3D conv from scratch; I3D is INFLATION of 2D weights; SlowFast is two pathways at different fps.
Forgetting that **two-stream's temporal stream has $2 L$ channels**, not $L$ (each flow field has $u$ and $v$ ).
Stating divided space-time attention is purely spatial — it's *temporal first, then spatial, alternating per block*.
Claiming Kinetics is trimmed but AVA is untrimmed — both are trimmed clips; AVA's distinction is *spatio-temporal localisation labels* (boxes + intervals), not untrimmed video.
Saying SlowFast's two pathways are independent — they have lateral connections that fuse motion (Fast) into semantics (Slow) at each stage.
Treating ViViT and TimeSformer as the same architecture — both use spatio-temporal tokens, but TimeSformer's contribution is specifically *divided attention*; ViViT explores multiple factorisations and tubelet vs frame embedding.
Quoting the variable-length quiz with "just resize" — that's spatial, not temporal. The quiz is about the *time* axis.

Shortcuts

Six action tasks: classification, temporal localisation, spatio-temporal localisation, captioning, retrieval, situation recognition.
Kinetics = ImageNet for videos. AVA = spatio-temporal localisation.
3D conv shape: $B \times C \times T \times H \times W$ ; filter $K_{T} \times K_{H} \times K_{W}$ per output channel.
I3D = inflation trick. Bridges image pretraining to video. Divide by $K_{T}$ .
Two-Stream: RGB (spatial) + optical flow (temporal, $2 L$ channels), late fusion.
SlowFast: Slow = high channels, low fps (semantics). Fast = low channels, high fps (motion). Lateral fusion.
TimeSformer winner: divided space-time attention (temporal then spatial, alternating).
Variable-length quiz: shorter → pad or interpolate; longer → subsample, sliding window, or multi-crop.

Proofs / Algorithms

I3D inflation preserves activations on a boring video. Let $W_{3D} (t, h, w) = W_{2D} (h, w) / K_{T}$ for all $t \in {1, \dots, K_{T}}$ , and let the input be a *boring video* — the same image $x_{0} (h, w)$ at every frame. Then the 3D conv output at a spatial location: $y_{3D} = \sum_{t, h, w} W_{3D} (t, h, w) \cdot x_{0} (h, w) = K_{T} \cdot \frac{1}{K _{T}} \sum_{h, w} W_{2D} (h, w) \cdot x_{0} (h, w) = y_{2D}$ . *Identical activations.* Hence the 3D net initialised by inflation produces the same outputs as the 2D net on a static-image video — a sensible warm-start.

Divided space-time attention asymptotic cost. With $T$ frames and $N$ spatial tokens per frame, total tokens $= T N$ . Joint attention cost per layer $= O ((T N)^{2} d) = O (T^{2} N^{2} d)$ . Divided: temporal step has $N$ independent sequences of length $T$ → $N \cdot T^{2} \cdot d$ ; spatial step has $T$ independent sequences of length $N$ → $T \cdot N^{2} \cdot d$ . Total $O ((T^{2} N + T N^{2}) d)$ . Ratio (joint / divided) $= T N / (T + N)$ , which grows linearly with min $(T, N)$ — *the bigger the clip, the larger the speedup.*

Two-stream channel count. $L$ flow fields each have a $(u, v)$ component map → $2 L$ channels feeding the temporal CNN's first conv. With $L = 10$ : 20 channels.

Computer Vision