Video Architectures — 3D CNNs, Two-Stream, SlowFast, ViViT, TimeSformer
Time Comes Alive
For thirteen units, your professor has been showing you how to extract meaning from a single frozen rectangle of pixels. But the world doesn't hold still. A person walking is not the same thing as a person standing. A glass falling is not the same thing as a glass resting. Time is the missing dimension — and every model you've learned so far is blind to it.
A video is just a stack of images, sure. But that's like saying a sentence is just a stack of letters — technically true and completely missing the point. Video understanding is the question: how do you teach a network to see motion, sequence, and intent?
The unit splits cleanly into two halves, and you should treat them as separate exam topics:
- Part 1: what tasks exist in video understanding (and what datasets we use).
- Part 2: what architectures solve them.
Part 1 — The landscape of video tasks
Where videos come from
Two axes — *edited vs unedited*, and *first-person (ego) vs third-person*:
- Edited, third-person: movies, sports, HowTo videos.
- Unedited, third-person: mobile recordings, livestreaming.
- First-person (ego): VLOGs, body-cam footage.
Why does this matter? Methods that work on clean, edited movie clips often crash on shaky first-person GoPro footage. The type of video constrains the architecture.
The dataset chronology — four eras
This slide order matters. The exam may ask you to *place these datasets in chronological order*.
Era 1 — The Golden Years (silhouette-based):
- KTH (Laptev, ICCV 2003) — six simple actions on plain backgrounds.
- Weizmann (Gorelick, ICCV 2005) — actions as space-time shapes.
Lab-grade testbeds before deep learning. Tiny.
Era 2 — Sub-100 classes (the deep-learning warm-up):
- HMDB-51 (51 classes) — ICCV 2011.
- UCF-101 (101 classes) — 2012.
- Something-Something (174 classes) — ICCV 2017.
UCF and HMDB are YouTube/movie clips. Something-Something is special — fine-grained interactions like *"putting something on top of something"* — testing whether models understand physics rather than memorising scene context.
Era 3 — The "ImageNet for videos":
- Kinetics (Carreira & Zisserman, CVPR 2017) — 400/600/700 action classes. The answer when somebody asks "is there an ImageNet for videos?" — yes, Kinetics.
Era 4 — Beyond classification:
- AVA (Gu, CVPR 2018) — *Atomic Visual Actions*. Spatio-temporally localised — bounding box AND time interval AND label.
- Charades — "Hollywood in Homes," everyday activity, crowdsourced.
- MSR-VTT — 10k clips with captions.
- LSMDC — movie clips with audio descriptions; used for identity-aware captioning.
Task taxonomy — six tasks
1. Action Classification — one label per clip. *"Is this dancing?"* 2. Temporal Action Localization — *when* in this untrimmed video does the action happen? Return time intervals. 3. Spatio-temporal Action Localization — *when AND where*? Bounding boxes existing during specific time intervals. AVA is canonical. 4. Video Captioning — describe what's happening in NL. 5. Text-to-Video Retrieval — inverse of captioning; joint text-video embedding. 6. Video Situation Recognition — structured roles (agent, patient, tool).
The long-form frontier
Most tasks operate on short clips (5–30s). Real videos are hours long. Two long-form tasks:
- Dense Event Captioning (Krishna, ICCV 2017) — given a long video, produce a list of events with time intervals AND captions.
- Identity-aware Captioning (LSMDC) — captions that name specific characters consistently. *"John walks into the room"* not *"a man walks in"*. Requires re-identification.
The six challenges of action understanding
Great short-answer ammunition:
1. Who is doing the action? 2. When does it start? 3. How long is it? 4. Actions vs interactions — one person dancing vs two people dancing together? 5. What are the essential components of an action? 6. What is the role of the background scene?
That last one is a famous source of bias: a model can score 80% on Kinetics by looking at the room, never looking at the person. This is why Something-Something is harder — backgrounds are intentionally uninformative.
Part 2 — The architectures
Five architectural families, each fixing a problem in the previous one.
Family 1 — Frame-by-frame 2D CNNs (the naïve baseline)
Treat each video as 30 independent images. Run ResNet50 on each frame. Average the predictions.
Works *surprisingly well* for some datasets, because — as we just said — many actions are recognisable from a single frame (the kitchen tells you it's cooking). But blind to motion: walking forward and walking backward look identical to this baseline.
Family 2 — 3D Convolutions (Conv3D)
The natural generalisation: if 2D conv slides a kernel over an image to capture spatial neighbours, 3D conv slides a kernel over a video volume.
Exam-bait shapes:
Input:
Filter: weights, biases
Output:
C3D (Tran, ICCV 2015) was the first to scale this up. Then I3D came.
Family 3 — I3D (Inflated 3D ConvNets)
The clever observation by Carreira & Zisserman (CVPR 2017): training 3D ConvNets from scratch is hard because video datasets are smaller than ImageNet. But pretrained 2D ConvNets sitting around (Inception, ResNet) are huge.
The I3D trick — inflation:
1. Take a pretrained 2D CNN (e.g., Inception). 2. Inflate every 2D filter into a 3D filter by repeating the 2D weights along a new time dimension. A filter becomes a filter where the same 2D weights are stacked times, **then divided by to preserve activation magnitude**. 3. Fine-tune on Kinetics.
Result: warm-start the 3D network with image features that already understand objects, scenes, textures — then it just has to learn the temporal part. I3D + Kinetics pretraining is the classical recipe for video classification.
Family 4 — Two-Stream Networks
A different attack: separate appearance from motion entirely.
Two-Stream (Simonyan & Zisserman, NeurIPS 2014) runs two parallel CNNs:
- Spatial stream: single RGB frame. Appearance (what objects, what scene).
- Temporal stream: stack of optical flow fields. Motion (how things move).
Predictions fused at the end.
Optical flow (exam definition): a dense pixel-level field of 2D vectors describing how each pixel moved between frame and frame . The explicit motion signal that frame-by-frame ConvNets miss. **Temporal stream's input has channels** (u, v per flow × L frames), not L.
Family 5 — CNN + RNN (LRCN)
Donahue et al.'s *Long-term Recurrent Convolutional Networks* (CVPR 2015): use a CNN as a frame encoder, then feed the sequence of frame features to an LSTM that reasons over time.
Frame 1 → CNN → feature₁
Frame 2 → CNN → feature₂ ──→ LSTM → action label / caption
Frame 3 → CNN → feature₃
...
Factors the problem cleanly: CNN handles space, RNN handles time. Excellent for variable-length outputs like captions.
Family 6 — SlowFast (Feichtenhofer, CVPR 2019)
One of the cleverest video architectures. The insight: objects and motion live at different timescales.
- The category of an object changes slowly — you don't need every frame to know it's a person.
- The motion changes fast — you need fine temporal resolution to distinguish running from walking.
SlowFast runs two parallel pathways:
- Slow pathway: low frame rate (4 frames), high channel capacity. Rich spatial/semantic features.
- Fast pathway: high frame rate (32 frames), low channel capacity. Motion patterns.
Lateral connections let the Fast pathway inject motion info into the Slow pathway during processing.
Memorise the two-pathway intuition — *"slow = semantics, fast = motion."*
Family 7 — Video Transformers
ViViT (Arnab, ICCV 2021) — Video Vision Transformer. Generalises ViT's patch tokenisation to spatio-temporal tubelets. Instead of spatial patches, cut the video into 3D tubelets. Run a Transformer over the tokens. Explores different attention factorisations: full joint space-time (expensive) or factorised (cheaper).
TimeSformer (Bertasius, ICML 2021) — *"Is space-time attention all you need for video?"* Yes, with a careful study of *where* to attend. The four variants tested:
- Space-only — per frame.
- Joint space-time — every token attends to every other. , prohibitive.
- Divided space-time — temporal attention first (each spatial location attends across time), then spatial attention within each frame.
- Sparse local.
Divided won. Dramatically cheaper than joint attention while keeping most of the accuracy. Memorise the recipe: *"divided attention — temporal then spatial, alternating."*
CNNs vs Transformers — the comparison
| 2D / 3D CNNs (SlowFast, I3D) | Transformer Video Models (ViViT, TimeSformer) | | --- | --- | | Convolutions | Patches/tubelets as tokens | | Pooling layers | Self-attention across tokens | | Local receptive fields | Global modelling across the image/video | | Inductive bias toward locality | Less inductive bias, learns from data | | Better with small data | Better with very large data |
The exam-bait quiz from your slides
*"You downloaded a 3D ConvNet that expects 64-frame inputs. Your videos range from 48–240 frames. Suggest 1 idea for shorter videos and 2 ideas for longer ones."*
Walk in with these answers ready:
Shorter than 64 frames:
- Pad the clip — repeat the last frame, loop, or zero-pad.
- Temporally upsample — interpolate intermediate frames.
Longer than 64 frames:
- Uniform subsampling — pick every 4th frame to get 60, then pad to 64. Simple, loses fast motion.
- Sliding window — process the video as multiple 64-frame chunks (stride 32 for overlap), aggregate predictions (mean or max pool).
- Random temporal crop during training, multi-crop average at test — random 64-frame window per training step; at inference, crop multiple windows and average.
Any two of those longer-video answers get full marks.
What you carry into the exam
The two-axis dataset taxonomy. The dataset chronology: KTH/Weizmann → HMDB/UCF/Something-Something → Kinetics → AVA. The six tasks. The six challenges (especially scene bias). 3D Conv shapes with filter . The I3D inflation trick with the normalisation. Two-Stream: temporal channels, optical-flow definition. LRCN: CNN-per-frame + LSTM, good for variable-length outputs. SlowFast: slow/fast pathways at different frame rates with lateral fusion. ViViT's tubelet embedding. TimeSformer's divided attention recipe. The CNN-vs-Transformer trade-off. The variable-length-clip quiz answers.
That's Video Understanding — the most recent thing in your professor's mind, and now in yours.