Revision Notes/Unit 12 — Video Understanding/Video Architectures — 3D CNNs, Two-Stream, SlowFast, ViViT, TimeSformer/Story

Video Architectures — 3D CNNs, Two-Stream, SlowFast, ViViT, TimeSformer

NotesStory

Unit 12 — Video Understanding

Time Comes Alive

For thirteen units, your professor has been showing you how to extract meaning from a single frozen rectangle of pixels. But the world doesn't hold still. A person walking is not the same thing as a person standing. A glass falling is not the same thing as a glass resting. Time is the missing dimension — and every model you've learned so far is blind to it.

A video is just a stack of images, sure. But that's like saying a sentence is just a stack of letters — technically true and completely missing the point. Video understanding is the question: how do you teach a network to see motion, sequence, and intent?

The unit splits cleanly into two halves, and you should treat them as separate exam topics:

Part 1: what tasks exist in video understanding (and what datasets we use).
Part 2: what architectures solve them.

Part 1 — The landscape of video tasks

Where videos come from

Two axes — *edited vs unedited*, and *first-person (ego) vs third-person*:

Edited, third-person: movies, sports, HowTo videos.
Unedited, third-person: mobile recordings, livestreaming.
First-person (ego): VLOGs, body-cam footage.

Why does this matter? Methods that work on clean, edited movie clips often crash on shaky first-person GoPro footage. The type of video constrains the architecture.

The dataset chronology — four eras

This slide order matters. The exam may ask you to *place these datasets in chronological order*.

Era 1 — The Golden Years (silhouette-based):

KTH (Laptev, ICCV 2003) — six simple actions on plain backgrounds.
Weizmann (Gorelick, ICCV 2005) — actions as space-time shapes.

Lab-grade testbeds before deep learning. Tiny.

Era 2 — Sub-100 classes (the deep-learning warm-up):

HMDB-51 (51 classes) — ICCV 2011.
UCF-101 (101 classes) — 2012.
Something-Something (174 classes) — ICCV 2017.

UCF and HMDB are YouTube/movie clips. Something-Something is special — fine-grained interactions like *"putting something on top of something"* — testing whether models understand physics rather than memorising scene context.

Era 3 — The "ImageNet for videos":

Kinetics (Carreira & Zisserman, CVPR 2017) — 400/600/700 action classes. The answer when somebody asks "is there an ImageNet for videos?" — yes, Kinetics.

Era 4 — Beyond classification:

AVA (Gu, CVPR 2018) — *Atomic Visual Actions*. Spatio-temporally localised — bounding box AND time interval AND label.
Charades — "Hollywood in Homes," everyday activity, crowdsourced.
MSR-VTT — 10k clips with captions.
LSMDC — movie clips with audio descriptions; used for identity-aware captioning.

Task taxonomy — six tasks

1. Action Classification — one label per clip. *"Is this dancing?"* 2. Temporal Action Localization — *when* in this untrimmed video does the action happen? Return time intervals. 3. Spatio-temporal Action Localization — *when AND where*? Bounding boxes existing during specific time intervals. AVA is canonical. 4. Video Captioning — describe what's happening in NL. 5. Text-to-Video Retrieval — inverse of captioning; joint text-video embedding. 6. Video Situation Recognition — structured roles (agent, patient, tool).

The long-form frontier

Most tasks operate on short clips (5–30s). Real videos are hours long. Two long-form tasks:

Dense Event Captioning (Krishna, ICCV 2017) — given a long video, produce a list of events with time intervals AND captions.
Identity-aware Captioning (LSMDC) — captions that name specific characters consistently. *"John walks into the room"* not *"a man walks in"*. Requires re-identification.

The six challenges of action understanding

Great short-answer ammunition:

1. Who is doing the action? 2. When does it start? 3. How long is it? 4. Actions vs interactions — one person dancing vs two people dancing together? 5. What are the essential components of an action? 6. What is the role of the background scene?

That last one is a famous source of bias: a model can score 80% on Kinetics by looking at the room, never looking at the person. This is why Something-Something is harder — backgrounds are intentionally uninformative.

Part 2 — The architectures

Five architectural families, each fixing a problem in the previous one.

Family 1 — Frame-by-frame 2D CNNs (the naïve baseline)

Treat each video as 30 independent images. Run ResNet50 on each frame. Average the predictions.

Works *surprisingly well* for some datasets, because — as we just said — many actions are recognisable from a single frame (the kitchen tells you it's cooking). But blind to motion: walking forward and walking backward look identical to this baseline.

Family 2 — 3D Convolutions (Conv3D)

The natural generalisation: if 2D conv slides a $K_{H} \times K_{W}$ kernel over an image to capture spatial neighbours, 3D conv slides a $K_{T} \times K_{H} \times K_{W}$ kernel over a video volume.

Exam-bait shapes:

Input: $B \times C_{in} \times T_{in} \times H_{in} \times W_{in}$

Filter: $C_{out} \times C_{in} \times K_{T} \times K_{H} \times K_{W}$ weights, $C_{out}$ biases

Output: $B \times C_{out} \times T_{out} \times H_{out} \times W_{out}$

C3D (Tran, ICCV 2015) was the first to scale this up. Then I3D came.

Family 3 — I3D (Inflated 3D ConvNets)

The clever observation by Carreira & Zisserman (CVPR 2017): training 3D ConvNets from scratch is hard because video datasets are smaller than ImageNet. But pretrained 2D ConvNets sitting around (Inception, ResNet) are huge.

The I3D trick — inflation:

1. Take a pretrained 2D CNN (e.g., Inception). 2. Inflate every 2D filter into a 3D filter by repeating the 2D weights along a new time dimension. A $K \times K$ filter becomes a $K_{T} \times K \times K$ filter where the same 2D weights are stacked $K_{T}$ times, **then divided by $K_{T}$ to preserve activation magnitude**. 3. Fine-tune on Kinetics.

Result: warm-start the 3D network with image features that already understand objects, scenes, textures — then it just has to learn the temporal part. I3D + Kinetics pretraining is the classical recipe for video classification.

Family 4 — Two-Stream Networks

A different attack: separate appearance from motion entirely.

Two-Stream (Simonyan & Zisserman, NeurIPS 2014) runs two parallel CNNs:

Spatial stream: single RGB frame. Appearance (what objects, what scene).
Temporal stream: stack of optical flow fields. Motion (how things move).

Predictions fused at the end.

Optical flow (exam definition): a dense pixel-level field of 2D vectors $(u, v)$ describing how each pixel moved between frame $t$ and frame $t + 1$ . The explicit motion signal that frame-by-frame ConvNets miss. **Temporal stream's input has $2 L$ channels** (u, v per flow × L frames), not L.

Family 5 — CNN + RNN (LRCN)

Donahue et al.'s *Long-term Recurrent Convolutional Networks* (CVPR 2015): use a CNN as a frame encoder, then feed the sequence of frame features to an LSTM that reasons over time.

Frame 1 → CNN → feature₁

Frame 2 → CNN → feature₂ ──→ LSTM → action label / caption

Frame 3 → CNN → feature₃

...

Factors the problem cleanly: CNN handles space, RNN handles time. Excellent for variable-length outputs like captions.

Family 6 — SlowFast (Feichtenhofer, CVPR 2019)

One of the cleverest video architectures. The insight: objects and motion live at different timescales.

The category of an object changes slowly — you don't need every frame to know it's a person.
The motion changes fast — you need fine temporal resolution to distinguish running from walking.

SlowFast runs two parallel pathways:

Slow pathway: low frame rate (4 frames), high channel capacity. Rich spatial/semantic features.
Fast pathway: high frame rate (32 frames), low channel capacity. Motion patterns.

Lateral connections let the Fast pathway inject motion info into the Slow pathway during processing.

Memorise the two-pathway intuition — *"slow = semantics, fast = motion."*

Family 7 — Video Transformers

ViViT (Arnab, ICCV 2021) — Video Vision Transformer. Generalises ViT's patch tokenisation to spatio-temporal tubelets. Instead of $16 \times 16$ spatial patches, cut the video into $t \times 16 \times 16$ 3D tubelets. Run a Transformer over the tokens. Explores different attention factorisations: full joint space-time (expensive) or factorised (cheaper).

TimeSformer (Bertasius, ICML 2021) — *"Is space-time attention all you need for video?"* Yes, with a careful study of *where* to attend. The four variants tested:

Space-only — per frame.
Joint space-time — every token attends to every other. $O ((T N)^{2})$ , prohibitive.
Divided space-time — temporal attention first (each spatial location attends across time), then spatial attention within each frame.
Sparse local.

Divided won. Dramatically cheaper than joint attention while keeping most of the accuracy. Memorise the recipe: *"divided attention — temporal then spatial, alternating."*

CNNs vs Transformers — the comparison

| 2D / 3D CNNs (SlowFast, I3D) | Transformer Video Models (ViViT, TimeSformer) | | --- | --- | | Convolutions | Patches/tubelets as tokens | | Pooling layers | Self-attention across tokens | | Local receptive fields | Global modelling across the image/video | | Inductive bias toward locality | Less inductive bias, learns from data | | Better with small data | Better with very large data |

The exam-bait quiz from your slides

*"You downloaded a 3D ConvNet that expects 64-frame inputs. Your videos range from 48–240 frames. Suggest 1 idea for shorter videos and 2 ideas for longer ones."*

Walk in with these answers ready:

Shorter than 64 frames:

Pad the clip — repeat the last frame, loop, or zero-pad.
Temporally upsample — interpolate intermediate frames.

Longer than 64 frames:

Uniform subsampling — pick every 4th frame to get 60, then pad to 64. Simple, loses fast motion.
Sliding window — process the video as multiple 64-frame chunks (stride 32 for overlap), aggregate predictions (mean or max pool).
Random temporal crop during training, multi-crop average at test — random 64-frame window per training step; at inference, crop multiple windows and average.

Any two of those longer-video answers get full marks.

What you carry into the exam

The two-axis dataset taxonomy. The dataset chronology: KTH/Weizmann → HMDB/UCF/Something-Something → Kinetics → AVA. The six tasks. The six challenges (especially scene bias). 3D Conv shapes $B \times C \times T \times H \times W$ with filter $K_{T} \times K_{H} \times K_{W}$ . The I3D inflation trick with the $/ K_{T}$ normalisation. Two-Stream: $2 L$ temporal channels, optical-flow definition. LRCN: CNN-per-frame + LSTM, good for variable-length outputs. SlowFast: slow/fast pathways at different frame rates with lateral fusion. ViViT's tubelet embedding. TimeSformer's divided attention recipe. The CNN-vs-Transformer trade-off. The variable-length-clip quiz answers.

That's Video Understanding — the most recent thing in your professor's mind, and now in yours.

Computer Vision