Revision Notes/Unit 2 — Dense Prediction: Segmentation + Depth/Dense Prediction — Segmentation & Monocular Depth/Story

Dense Prediction — Segmentation & Monocular Depth

Unit 2 — Dense Prediction: Segmentation + Depth

Painting Every Pixel

Object detection draws boxes around things. Useful — but a box around a cat doesn't tell you which pixels are actually cat and which are the wallpaper behind it. A box around a person doesn't separate their silhouette from the chair they're sitting on.

The harder task is dense prediction: give a label, a value, or a structure to every pixel in the image. Five flavours, in increasing difficulty: semantic segmentation (per-pixel class label, no instance distinction), instance segmentation (per-pixel class + instance ID for countable "things"), panoptic segmentation (everything labelled, instances for things), depth estimation (per-pixel continuous distance), and multi-task dense prediction (depth + normals + segmentation all at once from a shared encoder).

Your lecturer opens with a slide titled "Urge to Group". Humans don't see individual pixels; we see groups. Segmentation forces machines to do the same. But here's the trap: is APPEARANCE similarity the same as SEMANTIC similarity? Not always — two cats with very different fur colour are semantically the same; a black cat on a black couch shares appearance but not semantics. The whole field is the fight to learn semantic grouping, not appearance grouping.

The Classical Era (Mentioned For Context)

Before deep learning, segmentation was attempted four ways — memorise the categories for an exam list-question: region-based (region growing and splitting, iterative merging of similar pixels), boundary-based (find edges, close them into regions), motion-based (group pixels with consistent motion in video), and classification-based (label pixels using a region-property classifier). Adaptive thresholding (Chow & Kaneko) is the canonical worked example: vary the threshold across the image — low-pass-filter the image, subtract from the original — to handle uneven illumination. That's everything before 2015.

FCN — The Paper That Started Modern Segmentation

Long, Shelhamer, Darrell — *Fully Convolutional Networks for Semantic Segmentation* — CVPR 2015. Forty-seven thousand citations. The insight is breathtakingly simple: a CNN classifier ends in a fully-connected layer that demands a fixed input size and emits a single label. What if we replace the FC layer with a $1 \times 1$ convolution that produces a SPATIAL MAP of labels?

The architecture: image → conv blocks (downsampled feature map) → $1 \times 1$ conv (per-pixel classifier) → upsample back to image size → per-pixel labels. No FC anywhere. The network now accepts arbitrary input sizes because only conv kernels are learned, and produces per-pixel predictions on a downsampled grid that gets upsampled at the end.

The four properties that show up in every exam question on FCN: encoder captures semantic information (deep features are class-aware); decoder projects back to pixel space via upsampling; the bottleneck is low-resolution so the encoder's downsampling means small feature maps, and recovering details is hard — boundaries come out fuzzy; arbitrary input size because there's no FC fixing dimensions.

To address the fuzzy-boundary problem the FCN paper introduces variants: FCN-32s upsamples directly from the deepest layer (one big $32 \times$ jump → coarse). FCN-16s adds a skip from pool4 (stride 16), and FCN-8s additionally adds a skip from pool3 (stride 8). Each skip provides higher-resolution context that the bottleneck threw away. FCN-8s recovers genuinely sharp boundaries.

Transposed Convolution — The Upsampling Tool

How do you upsample feature maps inside a network? Bilinear interpolation works but isn't learnable. The right tool is transposed convolution — sometimes called "deconvolution", but your lecture explicitly warns: don't call it that. Deconvolution has a specific signal-processing meaning (reversing a convolution exactly to recover the original signal), and transposed conv doesn't recover anything; it just learns a useful upsampling.

The mechanism, which you should memorise: standard conv maps a large input to a smaller output — stride 2 means the filter moves 2 pixels in input per 1 pixel in output, so the output shrinks. Transposed conv FLIPS the relationship — stride 2 means the filter moves 2 pixels in OUTPUT per 1 pixel in INPUT, so the output grows. The input pixel value provides the WEIGHT for the filter placed at the output; scaled copies of the filter are placed at each output location corresponding to each input pixel; overlapping contributions are summed.

Why "transposed"? A standard convolution can be written as a matrix multiplication $y = W x$ . Transposed convolution is $y = W^{⊤} x$ — the transpose of the same matrix. Same parameters, different direction. That's the only relationship between the two.

The Local-vs-Global Dilemma — U-Net's Reason For Existing

Now we hit the central tension of semantic segmentation. Global context is essential for correct classification — to know "this is a cat", you need to see enough of the cat. Pixels alone are ambiguous. Local context is essential for correct localisation — to know where the cat's whiskers end and the background begins, you need fine-grained pixel-level information.

Pure encoder-decoder networks like vanilla FCN have a problem: the bottleneck destroys local information. The decoder must reconstruct details from an over-compressed feature map. Result: blurry boundaries.

Ronneberger, Fischer, Brox — *U-Net: Convolutional Networks for Biomedical Image Segmentation* — MICCAI 2015. Originally for cell-microscopy segmentation; became the architecture for everything dense.

The idea: when upsampling in the decoder, concatenate the corresponding encoder feature map (at matching spatial resolution) along the channel dimension before the next conv. The decoder now has access to high-resolution local information that was computed earlier in the encoder, BEFORE the bottleneck destroyed it. The diagram shape is a "U" — encoder going down on the left, decoder coming back up on the right, four horizontal CONCAT skips at each resolution.

The result: decoder gets BOTH deep semantic features (from the bottleneck chain) and shallow local features (from the skips). Boundaries become sharp. Detail is preserved. And in a direct connection to modern generative AI — Stable Diffusion's noise-predictor network is a U-Net. The same architecture that segments cells in 2015 powers billion-dollar text-to-image generators in 2024. U-Net is one of the most successful architectures in computer vision history.

A Quick Aside on $1 \times 1$ Convolutions

The lecture flags $1 \times 1$ convolutions briefly. Worth knowing the intuition: a $1 \times 1$ conv is equivalent to applying a linear layer (an MLP layer) at each spatial location independently. It doesn't aggregate over neighbours — only mixes channels. Three uses: dimensionality reduction (squash 1024 channels to 256 for a bottleneck), cross-channel nonlinearities (after a $1 \times 1$ conv you can apply ReLU to get nonlinear channel mixing), and per-pixel classifier (the final layer of an FCN: a $1 \times 1$ conv with K output channels gives K-way classification per pixel). You met this trick in Faster R-CNN's RPN — $3 \times 3$ conv followed by two $1 \times 1$ heads. Same idea.

Instance Segmentation — Mask R-CNN

Semantic segmentation tells you which pixels are cat. Instance segmentation tells you which pixels belong to Cat #1 versus Cat #2. This adds the "individuality" of object detection back into the per-pixel framework.

He, Gkioxari, Dollár, Girshick — *Mask R-CNN* — ICCV 2017, forty-one thousand citations. The canonical instance segmentation architecture. The recipe is short to state: Faster R-CNN with a third head.

Faster R-CNN already has a classification head (class label per RoI) and a box regression head (box offsets per RoI). Mask R-CNN adds a mask head that outputs a $28 \times 28$ binary mask per RoI per class — a tiny FCN, just a few conv layers operating on the RoI-pooled feature, trained with per-pixel binary cross-entropy on each RoI.

The clever design choice is per-class masks. For COCO's 80 classes the head produces 80 masks per RoI, then selects the mask of the class predicted by the classification head. Why decouple this way? The mask head doesn't have to also decide WHAT class the object is — that's already done. Cleaner training, and empirically better masks.

RoI Align — A Small But Critical Detail

The original Faster R-CNN used RoI Pool to extract fixed-size features from variable-size proposals. RoI Pool rounds RoI coordinates to integer pixels — fine for classification (a few pixels of misalignment doesn't change "is this a cat?"), but a few pixels of error matters enormously for masks where pixel-precise alignment is the whole game.

Mask R-CNN replaces RoI Pool with RoI Align: use bilinear interpolation at exact fractional pixel coordinates within the feature map. NO ROUNDING. Per bin you place 4 regularly-spaced sample points; at each sample point the feature value is the bilinear interpolation of the four surrounding feature-map cells. The result: pixel-aligned masks. Mask R-CNN's mask AP jumps purely from this one engineering change.

PointRend (Kirillov et al., CVPR 2020) takes the boundary problem further. Interiors are easy ("clearly cat"); edges are ambiguous ("is this pixel cat or carpet?"). PointRend predicts a coarse low-resolution mask, identifies pixels where the prediction is near 0.5 (likely near boundaries), and applies a separate point-based MLP at those specific points using high-resolution features. Iterate, refining only where it's hard. Adaptive subdivision — like adaptive ray tracing in graphics. Spend compute only where it matters.

Panoptic Segmentation — Semantic + Instance

The lecture defines things versus stuff on a slide. Memorise both. Things are countable objects with proper geometry: person, car, animal — they have distinct instances. Stuff is amorphous regions identified by texture or material: sky, road, water — no instance distinction. You can have instances of things (Cat #1, Cat #2). You cannot have instances of sky.

Panoptic segmentation asks: per pixel, give me BOTH a semantic label AND an instance ID for things. For stuff, just the semantic label. It's the most complete scene parsing — every pixel accounted for in a unified format. Useful for autonomous driving (you need to know road as stuff and cars as individual things), robotics, and scene understanding. EfficientPS (Mohan & Valada, 2020) is one canonical architecture: shared encoder, two parallel decoders (semantic + instance), fusion module that combines outputs into a single panoptic map.

Depth Estimation — MiDaS And The Relative-Depth Insight

Depth estimation is the same dense-prediction framework, but the per-pixel output is a continuous value — distance from camera — rather than a discrete label. Monocular depth estimation predicts depth from a single image. Geometrically ill-posed (a small near object and a large far object can look identical), but neural networks can learn priors from data — typical sizes of things, scene structure, perspective cues — that resolve the ambiguity in practice.

MiDaS (Ranftl et al., 2019+) takes a clever approach: instead of trying to predict metric depth in metres, predict relative depth — order, or ratio. This is much easier — you only need to know "the bench is closer than the tree", not "the bench is 4.2 m away". Why does this matter? Because relative-depth signals are everywhere in training data: stereo pairs (with disparity → relative depth), 3D movies, structure-from-motion outputs, synthetic data, even unlabelled web video provides ordinal depth via motion parallax. By relaxing the target from metric to relative, MiDaS can train on many heterogeneous data sources that wouldn't combine if forced to a common metric scale. The loss is scale-and-shift-invariant L1: $min_{s, t} \sum_{i} ∣ s \cdot d_{i} + t - d_{i}^{*} ∣$ — align prediction and target by the optimal $(s, t)$ before computing the residual. Result: a depth model that generalises to arbitrary new images.

ZoeDepth (Bhat et al., 2023) — "Zero-shot Transfer by Combining Relative and Metric Depth". The clever paper title is the algorithm: pretrain on heterogeneous data using MiDaS-style relative depth; fine-tune separately on metric depth datasets (KITTI for outdoor, NYU for indoor) with metric heads; at inference, predict relative depth (still generalises) and convert to metric using the fine-tuned head appropriate to the scene type. Best of both worlds.

And once you have a depth map, you can back-project it into a 3D point cloud using the pinhole camera model: each pixel $(u, v, d)$ becomes a 3D point $(X, Y, Z) = ((u - c_{x}) d / f_{x}, (v - c_{y}) d / f_{y}, d)$ . You now have a 3D point cloud from a single 2D photo — bridge back to the point-cloud chapter. Modern AR apps do this in real-time.

Multi-Task Dense Prediction

A natural question: do depth, normals, and segmentation share features? Mostly yes — all require understanding scene structure. Eigen & Fergus (CVPR 2015) showed that one network with one shared encoder and three task-specific heads can predict depth, surface normals, and semantic labels simultaneously. This is the precursor to modern "foundation vision models" like Sapiens (Meta, ECCV 2024) which uses the same recipe at much larger scale. The principle generalises.

Metrics — What Goes On The Slide And Why

Your lecture has an entire "Review of Metrics" deck. Walk through it cleanly — these are guaranteed exam targets. For classification we already know: accuracy, balanced accuracy, F1, AP and AUROC. For detection: IoU at thresholds, AP, mAP@ $0.5$ for VOC, mAP@ $[0.5 : 0.05 : 0.95]$ for COCO. Your lecturer flags repeatedly: IoU = 0.5 is NOT half overlapping and does not mean intersection = 0.5.

Segmentation introduces new metrics. Per-pixel accuracy is a trap. The lecturer's example: 95% pixel accuracy looks great, but if 95% of pixels are background, predicting "background everywhere" gives 95% — useless. Better: IoU at the pixel level, per class. mIoU = mean IoU across classes — the dominant segmentation metric. Dice coefficient = $2∣ A \cap B ∣/ (∣ A ∣ + ∣ B ∣)$ — same as F1 score interpreted on pixels, denominator is SUM not union (that's the only difference from IoU). Dice and IoU rank similarly but Dice slightly favours small objects. Mask AP — Mask R-CNN-style — treats each predicted mask like a detection and computes AP across IoU thresholds. mBIoU — boundary-aware IoU — computes IoU only on pixels near object boundaries, emphasising the hard part.

Losses — And The Focal Loss Centrepiece

What loss do you train a segmentation network with? Standard cross-entropy is the default — per-pixel CE, simple, well-understood. The problem: when one class dominates (95% of pixels are background), easy-to-classify background pixels overwhelm the loss and drown out signal from foreground. Balanced CE reweights inversely to class frequency — background gets weight 0.05, foreground gets 1.0. Dice loss is $1 - Dice coefficient$ — optimises overlap directly, works well in medical imaging where foreground is small. IoU loss is $1 - IoU$ ; GIoU loss extends IoU to handle non-overlapping boxes via $GIoU = IoU - ∣ C ∖ (A \cup B) ∣/∣ C ∣$ where $C$ is the smallest enclosing box, adding gradient signal even when boxes don't overlap.

And then there's Focal Loss — the quiz-slide centrepiece. From RetinaNet (Lin et al., ICCV 2017) — *Focal Loss for Dense Object Detection*, the paper that finally let single-stage detectors match two-stage accuracy.

Standard binary cross-entropy is $CE (p_{t}) = - lo g (p_{t})$ where $p_{t} = p$ if $y = 1$ , $p_{t} = 1 - p$ if $y = 0$ . The problem: even well-classified examples ( $p_{t} = 0.9$ ) still contribute non-trivial loss — $- lo g (0.9) \approx 0.1$ . When you have 10,000 well-classified background pixels and 10 hard-to-classify foreground pixels, the background pixels collectively contribute $10, 000 \times 0.1 = 1000$ to the loss versus the foreground's $10 \times 2.3 = 23$ . The easy background dominates training.

Focal Loss adds a modulating factor that suppresses easy examples: $FL (p_{t}) = - (1 - p_{t})^{γ} lo g p_{t}$ . The factor $(1 - p_{t})^{γ}$ approaches 0 for well-classified examples ( $p_{t} \to 1$ ) and 1 for poorly classified ones ( $p_{t} \to 0$ ). With $γ = 2$ (the typical value): $p_{t} = 0.9$ gets $(0.1)^{2} = 0.01$ — a 100× down-weighting; $p_{t} = 0.5$ gets $(0.5)^{2} = 0.25$ — 4× down-weighting; $p_{t} = 0.1$ gets $(0.9)^{2} = 0.81$ — barely changed. Easy negatives contribute ~100× less; hard positives contribute fully. The gradient is now dominated by what the model is getting WRONG, not by what it's already right about. When $γ = 0$ , Focal Loss reduces to ordinary CE. The original paper also adds an $α$ -balanced version: $FL = - α (1 - p_{t})^{γ} lo g p_{t}$ . Best results: $α = 0.25, γ = 2.0$ .

Memorise the quiz answer: Focal Loss adds a multiplicative factor $(1 - p_{t})^{γ}$ that approaches 0 for well-classified examples and 1 for poorly classified ones. Easy examples are suppressed by orders of magnitude, so the gradient is dominated by hard examples. This fixes the imbalance between many easy negatives and few hard positives in dense detection and segmentation.

What To Walk Into The Exam Carrying

The four classical segmentation paradigms (region-based, boundary-based, motion-based, classification-based) — context only; the deep-learning era took over. The task staircase: classification → classification+localization → object detection → semantic segmentation → instance segmentation, each row adding I/O complexity. FCN: replace FC with $1 \times 1$ conv, encoder for semantics + decoder for spatial upsampling, bottleneck is blurry, arbitrary input size. Transposed convolution: learnable upsampling, input pixel provides weight for filter at output, NOT deconvolution. The local-vs-global dilemma: global for classification, local for boundaries, bottleneck networks lose local detail. U-Net's CONCAT skip connections at each resolution — decoder gets both semantic (bottleneck) and local (skip); sharp boundaries; powers Stable Diffusion. $1 \times 1$ conv = linear layer per spatial location. Mask R-CNN: Faster R-CNN plus a third $28 \times 28$ mask head per RoI per class; decoupled training. RoI Align: bilinear interpolation at float coords with 4 sample points per bin, no rounding, critical for pixel-precise masks. PointRend: adaptive subdivision at uncertain pixels — compute where it matters. Things vs Stuff: countable vs amorphous; panoptic combines both. MiDaS: relative monocular depth with scale-and-shift-invariant loss; trains on heterogeneous depth sources. ZoeDepth: relative pretraining + metric fine-tuning. mIoU is the segmentation metric; Dice's denominator is SUM not union; pixel accuracy is a trap. Focal Loss equation, the modulating factor's behaviour at high and low $p_{t}$ , why it helps class imbalance — that's the most likely high-mark question on this lecture.

That's dense prediction. From per-image labels to per-pixel everything, through FCN, U-Net, Mask R-CNN, PointRend, panoptic, MiDaS, ZoeDepth, and the loss landscape that makes all of them trainable on imbalanced real data.

Computer Vision