Saral Shiksha Yojna
Courses/Computer Vision

Computer Vision

CSE471
Prof. Makarand Tapaswi + Prof. Charu SharmaSpring 2025-264 credits

Video Architectures — 3D CNNs, Two-Stream, SlowFast, ViViT, TimeSformer

NotesStory
Unit 12 — Video Understanding

Time Comes Alive

For thirteen units, your professor has been showing you how to extract meaning from a single frozen rectangle of pixels. But the world doesn't hold still. A person walking is not the same thing as a person standing. A glass falling is not the same thing as a glass resting. Time is the missing dimension — and every model you've learned so far is blind to it.

A video is just a stack of images, sure. But that's like saying a sentence is just a stack of letters — technically true and completely missing the point. Video understanding is the question: how do you teach a network to see motion, sequence, and intent?

The unit splits cleanly into two halves, and you should treat them as separate exam topics:

  • Part 1: what tasks exist in video understanding (and what datasets we use).
  • Part 2: what architectures solve them.

Part 1 — The landscape of video tasks

Where videos come from

Two axes — *edited vs unedited*, and *first-person (ego) vs third-person*:

  • Edited, third-person: movies, sports, HowTo videos.
  • Unedited, third-person: mobile recordings, livestreaming.
  • First-person (ego): VLOGs, body-cam footage.

Why does this matter? Methods that work on clean, edited movie clips often crash on shaky first-person GoPro footage. The type of video constrains the architecture.

The dataset chronology — four eras

This slide order matters. The exam may ask you to *place these datasets in chronological order*.

Era 1 — The Golden Years (silhouette-based):

  • KTH (Laptev, ICCV 2003) — six simple actions on plain backgrounds.
  • Weizmann (Gorelick, ICCV 2005) — actions as space-time shapes.

Lab-grade testbeds before deep learning. Tiny.

Era 2 — Sub-100 classes (the deep-learning warm-up):

  • HMDB-51 (51 classes) — ICCV 2011.
  • UCF-101 (101 classes) — 2012.
  • Something-Something (174 classes) — ICCV 2017.

UCF and HMDB are YouTube/movie clips. Something-Something is special — fine-grained interactions like *"putting something on top of something"* — testing whether models understand physics rather than memorising scene context.

Era 3 — The "ImageNet for videos":

  • Kinetics (Carreira & Zisserman, CVPR 2017) — 400/600/700 action classes. The answer when somebody asks "is there an ImageNet for videos?" — yes, Kinetics.

Era 4 — Beyond classification:

  • AVA (Gu, CVPR 2018) — *Atomic Visual Actions*. Spatio-temporally localised — bounding box AND time interval AND label.
  • Charades — "Hollywood in Homes," everyday activity, crowdsourced.
  • MSR-VTT — 10k clips with captions.
  • LSMDC — movie clips with audio descriptions; used for identity-aware captioning.

Task taxonomy — six tasks

1. Action Classification — one label per clip. *"Is this dancing?"* 2. Temporal Action Localization — *when* in this untrimmed video does the action happen? Return time intervals. 3. Spatio-temporal Action Localization — *when AND where*? Bounding boxes existing during specific time intervals. AVA is canonical. 4. Video Captioning — describe what's happening in NL. 5. Text-to-Video Retrieval — inverse of captioning; joint text-video embedding. 6. Video Situation Recognition — structured roles (agent, patient, tool).

The long-form frontier

Most tasks operate on short clips (5–30s). Real videos are hours long. Two long-form tasks:

  • Dense Event Captioning (Krishna, ICCV 2017) — given a long video, produce a list of events with time intervals AND captions.
  • Identity-aware Captioning (LSMDC) — captions that name specific characters consistently. *"John walks into the room"* not *"a man walks in"*. Requires re-identification.

The six challenges of action understanding

Great short-answer ammunition:

1. Who is doing the action? 2. When does it start? 3. How long is it? 4. Actions vs interactions — one person dancing vs two people dancing together? 5. What are the essential components of an action? 6. What is the role of the background scene?

That last one is a famous source of bias: a model can score 80% on Kinetics by looking at the room, never looking at the person. This is why Something-Something is harder — backgrounds are intentionally uninformative.

Part 2 — The architectures

Five architectural families, each fixing a problem in the previous one.

Family 1 — Frame-by-frame 2D CNNs (the naïve baseline)

Treat each video as 30 independent images. Run ResNet50 on each frame. Average the predictions.

Works *surprisingly well* for some datasets, because — as we just said — many actions are recognisable from a single frame (the kitchen tells you it's cooking). But blind to motion: walking forward and walking backward look identical to this baseline.

Family 2 — 3D Convolutions (Conv3D)

The natural generalisation: if 2D conv slides a kernel over an image to capture spatial neighbours, 3D conv slides a kernel over a video volume.

Exam-bait shapes:

Input:
Filter: weights, biases
Output:

C3D (Tran, ICCV 2015) was the first to scale this up. Then I3D came.

Family 3 — I3D (Inflated 3D ConvNets)

The clever observation by Carreira & Zisserman (CVPR 2017): training 3D ConvNets from scratch is hard because video datasets are smaller than ImageNet. But pretrained 2D ConvNets sitting around (Inception, ResNet) are huge.

The I3D trick — inflation:

1. Take a pretrained 2D CNN (e.g., Inception). 2. Inflate every 2D filter into a 3D filter by repeating the 2D weights along a new time dimension. A filter becomes a filter where the same 2D weights are stacked times, **then divided by to preserve activation magnitude**. 3. Fine-tune on Kinetics.

Result: warm-start the 3D network with image features that already understand objects, scenes, textures — then it just has to learn the temporal part. I3D + Kinetics pretraining is the classical recipe for video classification.

Family 4 — Two-Stream Networks

A different attack: separate appearance from motion entirely.

Two-Stream (Simonyan & Zisserman, NeurIPS 2014) runs two parallel CNNs:

  • Spatial stream: single RGB frame. Appearance (what objects, what scene).
  • Temporal stream: stack of optical flow fields. Motion (how things move).

Predictions fused at the end.

Optical flow (exam definition): a dense pixel-level field of 2D vectors describing how each pixel moved between frame and frame . The explicit motion signal that frame-by-frame ConvNets miss. **Temporal stream's input has channels** (u, v per flow × L frames), not L.

Family 5 — CNN + RNN (LRCN)

Donahue et al.'s *Long-term Recurrent Convolutional Networks* (CVPR 2015): use a CNN as a frame encoder, then feed the sequence of frame features to an LSTM that reasons over time.

Frame 1 → CNN → feature₁
Frame 2 → CNN → feature₂ ──→ LSTM → action label / caption
Frame 3 → CNN → feature₃
...

Factors the problem cleanly: CNN handles space, RNN handles time. Excellent for variable-length outputs like captions.

Family 6 — SlowFast (Feichtenhofer, CVPR 2019)

One of the cleverest video architectures. The insight: objects and motion live at different timescales.

  • The category of an object changes slowly — you don't need every frame to know it's a person.
  • The motion changes fast — you need fine temporal resolution to distinguish running from walking.

SlowFast runs two parallel pathways:

  • Slow pathway: low frame rate (4 frames), high channel capacity. Rich spatial/semantic features.
  • Fast pathway: high frame rate (32 frames), low channel capacity. Motion patterns.

Lateral connections let the Fast pathway inject motion info into the Slow pathway during processing.

Memorise the two-pathway intuition — *"slow = semantics, fast = motion."*

Family 7 — Video Transformers

ViViT (Arnab, ICCV 2021) — Video Vision Transformer. Generalises ViT's patch tokenisation to spatio-temporal tubelets. Instead of spatial patches, cut the video into 3D tubelets. Run a Transformer over the tokens. Explores different attention factorisations: full joint space-time (expensive) or factorised (cheaper).

TimeSformer (Bertasius, ICML 2021) — *"Is space-time attention all you need for video?"* Yes, with a careful study of *where* to attend. The four variants tested:

  • Space-only — per frame.
  • Joint space-time — every token attends to every other. , prohibitive.
  • Divided space-time — temporal attention first (each spatial location attends across time), then spatial attention within each frame.
  • Sparse local.

Divided won. Dramatically cheaper than joint attention while keeping most of the accuracy. Memorise the recipe: *"divided attention — temporal then spatial, alternating."*

CNNs vs Transformers — the comparison

| 2D / 3D CNNs (SlowFast, I3D) | Transformer Video Models (ViViT, TimeSformer) | | --- | --- | | Convolutions | Patches/tubelets as tokens | | Pooling layers | Self-attention across tokens | | Local receptive fields | Global modelling across the image/video | | Inductive bias toward locality | Less inductive bias, learns from data | | Better with small data | Better with very large data |

The exam-bait quiz from your slides

*"You downloaded a 3D ConvNet that expects 64-frame inputs. Your videos range from 48–240 frames. Suggest 1 idea for shorter videos and 2 ideas for longer ones."*

Walk in with these answers ready:

Shorter than 64 frames:

  • Pad the clip — repeat the last frame, loop, or zero-pad.
  • Temporally upsample — interpolate intermediate frames.

Longer than 64 frames:

  • Uniform subsampling — pick every 4th frame to get 60, then pad to 64. Simple, loses fast motion.
  • Sliding window — process the video as multiple 64-frame chunks (stride 32 for overlap), aggregate predictions (mean or max pool).
  • Random temporal crop during training, multi-crop average at test — random 64-frame window per training step; at inference, crop multiple windows and average.

Any two of those longer-video answers get full marks.

What you carry into the exam

The two-axis dataset taxonomy. The dataset chronology: KTH/Weizmann → HMDB/UCF/Something-Something → Kinetics → AVA. The six tasks. The six challenges (especially scene bias). 3D Conv shapes with filter . The I3D inflation trick with the normalisation. Two-Stream: temporal channels, optical-flow definition. LRCN: CNN-per-frame + LSTM, good for variable-length outputs. SlowFast: slow/fast pathways at different frame rates with lateral fusion. ViViT's tubelet embedding. TimeSformer's divided attention recipe. The CNN-vs-Transformer trade-off. The variable-length-clip quiz answers.

That's Video Understanding — the most recent thing in your professor's mind, and now in yours.