Courses/Computer Vision

Computer Vision

CSE471

Prof. Makarand Tapaswi + Prof. Charu Sharma•Spring 2025-26•4 credits

True / False (with reasoning)

Exposes shallow understanding. Always include the reason.

Computer vision is easy because humans do it effortlessly.

CNNs are translation equivariant AND rotation equivariant.

The Sobel and Laplacian kernels both sum to zero on a flat region.

Otsu's method MINIMISES between-class variance.

Forward warping is preferred over inverse warping in practice.

Cross-entropy loss is preferred over MSE for classification because its gradient does not vanish at saturation.

BatchNorm at inference time uses the batch's own mean and variance.

Bagging reduces bias; boosting reduces variance.

Conv-layer parameter count depends on the input spatial size $H, W$ .

Pooling layers have learnable parameters.

Inception's 1×1 bottleneck is placed AFTER the expensive 3×3/5×5 convs.

I3D inflates 2D pretrained kernels along time and divides by $K_{t}$ to preserve activation magnitude.

YOLO predicts a separate class distribution for each of the B bounding boxes within a grid cell.

NMS is applied globally across all classes simultaneously.

RoI Align replaces RoI Pool's quantization with bilinear interpolation, yielding sub-pixel alignment.

PAFs encode the LENGTH of each limb at every pixel.

PointNet++ uses kNN in feature space to define local neighbourhoods.

3DGS has learnable weights that generalise across scenes.

Without positional encoding, a Transformer treats input tokens as a set rather than a sequence.

Doubling the number of attention heads (at fixed d_model) roughly doubles the parameter count.

SimCLR's projection head is kept for downstream tasks.

DINO requires negative samples to avoid collapse.

PreNorm and PostNorm are mathematically equivalent.

SigLIP's loss is computed independently per pair and does not require batch-wide synchronisation.

Two-Stream networks' temporal stream takes optical flow as input.