Revision Notes/Unit 2 — Dense Prediction: Segmentation + Depth/Dense Prediction — Segmentation & Monocular Depth

Dense Prediction — Segmentation & Monocular Depth

Intuition

Detection labels regions; dense prediction labels every pixel. The architectural challenge is restoring spatial resolution lost during the encoder's downsampling — either via transposed convolutions (learnable upsampling), skip connections that re-inject shallow high-resolution features (U-Net), or dilated convolutions that grow the receptive field without losing resolution (DeepLab). Beyond segmentation, the same dense-prediction framework handles depth and surface normals.

Explanation

The five flavours of dense prediction. Semantic segmentation: per-pixel class label, no instance distinction (all 'person' pixels share one mask). Instance segmentation: per-pixel (label + instance ID), but only for 'things' (countable objects); 'stuff' (sky, road) is unlabelled. Panoptic segmentation: per-pixel class label for everything, plus instance IDs for things — the most complete scene parsing. Depth estimation: per-pixel continuous distance from camera. Multi-task dense prediction: depth + normals + segmentation + … from one shared encoder with task-specific heads.

The 'urge to group'. Humans don't see individual pixels; we see groups. Segmentation forces machines to do the same. The fundamental question: is APPEARANCE similarity the same as SEMANTIC similarity? Not always — two cats with very different fur colour are semantically the same object; a black cat on a black couch shares appearance but not semantics. The whole field is the fight to learn semantic grouping rather than appearance grouping.

Things vs Stuff (the central distinction the lecture flags on a slide): THINGS are countable objects with proper geometry — person, car, animal — and have distinct instances (Cat #1, Cat #2). STUFF is amorphous regions identified by texture or material — sky, road, water — with no instance distinction. You can have instances of things; you cannot have instances of sky.

Classical taxonomy (pre-DL). Four families to name-drop: region-based (region growing/splitting, iterative merging); boundary-based (edge detection then close boundaries); motion-based (group pixels with consistent motion in video); classification-based (train a classifier on region properties). Adaptive thresholding (Chow & Kaneko) is the canonical example: vary the threshold across the image — low-pass-filter the image, subtract from original, threshold the residual — to handle uneven illumination. Mentioned for context; the rest of the lecture is the deep-learning era.

Fully Convolutional Networks (FCN, Long et al., CVPR 2015, 47 k+ citations) — the paper that started modern segmentation. Breathtakingly simple insight: a CNN classifier ends in a fully-connected layer that requires fixed input size and emits one label. Replace the FC with a $1 \times 1$ convolution that produces a spatial map of labels. Architecture: image → conv blocks (downsampled feature map) → $1 \times 1$ conv per-pixel classifier → UPSAMPLE back to image size → per-pixel labels. Four exam-worthy properties: encoder captures semantic info; decoder projects back to pixel space via upsampling; the bottleneck is low-resolution so boundaries come out fuzzy; the model accepts ARBITRARY input size because there's no FC fixing the dimensions. FCN-32s upsamples directly from the deepest layer (coarse). FCN-16s and FCN-8s add skip connections from pool4 and pool3, summing them with the upsampled deep features — sharper boundaries.

Transposed convolution — learnable upsampling. Standard conv: large input → smaller output (stride 2 means the filter moves 2 pixels in input per 1 pixel in output). Transposed conv flips the relationship — stride 2 means the filter moves 2 pixels in OUTPUT per 1 pixel in INPUT, so the output grows. Mechanism: the input pixel value provides the WEIGHT for the filter placed at the output; place a scaled copy of the filter at each output location corresponding to each input pixel; sum overlapping contributions. Equivalent matrix view: standard conv is $y = W x$ ; transposed conv is $y = W^{⊤} x$ — same parameters, transpose direction. Do NOT call it 'deconvolution' — that term has a specific signal-processing meaning (inverting a convolution to recover the original signal), and transposed conv doesn't invert anything, it just learns a useful upsampling. PyTorch: nn.ConvTranspose2d(in_channels=16, out_channels=33, kernel_size=3, stride=2).

The local-vs-global dilemma. Global context is essential for correct classification (to know 'this is a cat' you need to see enough of the cat — pixels alone are ambiguous). Local context is essential for correct localisation (to know where the cat's whiskers end and the background begins, you need fine-grained pixel-level info). Pure encoder-decoder networks like vanilla FCN have a problem: the bottleneck destroys local information; the decoder must reconstruct details from an over-compressed feature map. Result: blurry boundaries.

U-Net (Ronneberger et al., MICCAI 2015). When upsampling in the decoder, CONCATENATE the corresponding encoder feature map (at matching spatial resolution) along the channel dimension before the next conv. The decoder now has access to high-resolution local info that was computed earlier in the encoder, before the bottleneck destroyed it. Diagram shape is a 'U' — encoder going down, decoder coming back up, horizontal skips at each resolution. Result: decoder gets BOTH deep semantic features (from the bottleneck chain) and shallow local features (from skips). Boundaries become sharp. Originally for cell-microscopy segmentation; now powers Stable Diffusion's denoising U-Net. One of the most successful architectures in CV history.

Dilated (atrous) convolutions. Insert gaps of $(rate - 1)$ pixels between kernel taps. Effective receptive field grows multiplicatively without adding parameters or losing resolution (unlike pooling/strided conv). DeepLab stacks dilated convs to keep the feature map at high resolution while still capturing global context. ASPP (Atrous Spatial Pyramid Pooling) in DeepLab v3 uses parallel branches with rates 6, 12, 18 to capture multi-scale context. Drawback: GRIDDING ARTIFACTS — at large dilation rates the kernel samples a sparse, regular grid, so nearby pixels are never compared; checkerboard outputs. Mitigation: use hybrid dilation rates with no common factor (co-prime rates).

1×1 convolution. Equivalent to applying a linear layer (MLP) at each spatial location independently — no neighbour aggregation, only channel mixing. Three uses: (1) DIMENSIONALITY REDUCTION (squash 1024 channels to 256); (2) cross-channel nonlinearity (1×1 conv + ReLU mixes channels nonlinearly); (3) per-pixel classifier (final layer of FCN: $1 \times 1$ conv with K output channels gives K-way classification per pixel). Same trick used in Faster R-CNN's RPN ( $3 \times 3$ conv → two $1 \times 1$ heads).

Mask R-CNN (He, Gkioxari, Dollár, Girshick, ICCV 2017, 41 k+ citations) — the canonical instance segmentation architecture. Faster R-CNN + a THIRD head: a tiny FCN (a few conv layers on the RoI-pooled feature) that outputs a $28 \times 28$ binary mask per RoI per class. At inference, take the mask of the class predicted by the classification head and upsample to the box size. Trained with per-pixel binary cross-entropy on each RoI, only on the GT-class mask. Why PER-CLASS masks (80 masks per RoI for COCO)? Decouples classification from mask prediction — the mask head doesn't have to also decide WHAT class the object is. Cleaner training. Loss: $L = L_{cls} + L_{box} + L_{mask}$ .

RoI Align — the small detail that changes everything. Faster R-CNN's RoI Pool quantises RoI coordinates to integer pixels — fine for classification (a few pixels of misalignment doesn't change 'is this a cat?'), but FATAL for mask prediction where pixel-precise alignment matters. RoI Align replaces quantisation with bilinear interpolation: at each sample point inside a bin (typically 4 sample points per bin), compute the value as a weighted average of the four nearest feature-map cells using bilinear weights. NO ROUNDING. Result: pixel-aligned masks. The lecture frames this as 'a small but critical detail' — Mask R-CNN's mask quality jumps purely from this change.

PointRend (Kirillov et al., CVPR 2020) — adaptive subdivision. The boundary is the hard part of segmentation: interiors are easy ('clearly cat'), edges are ambiguous ('is this pixel cat or carpet?'). PointRend addresses this by: (1) predicting a coarse low-resolution mask; (2) identifying UNCERTAIN pixels (probability close to 0.5 — likely near boundaries); (3) for those specific points, applying a separate point-based MLP that takes a high-resolution feature at that location to make a better prediction; (4) iterating, refining only where it's hard. Analogous to adaptive ray tracing — spend compute only where it matters. Sharper boundaries at modest extra cost.

Panoptic segmentation (Kirillov et al., CVPR 2019). Per-pixel output gives BOTH a semantic label AND an instance ID for things; stuff just gets the semantic label. Most complete scene parsing — useful for autonomous driving (need road as stuff, cars as individual things), robotics, scene understanding. EfficientPS (Mohan & Valada, 2020) is one canonical architecture: shared encoder, two parallel decoders (semantic + instance), fusion module to merge outputs into one panoptic map.

Monocular depth estimation. Predict per-pixel depth from a single image. Geometrically ill-posed (a small near object and a large far object can look identical), but neural networks learn priors from data — typical sizes of things, scene structure, perspective cues — that resolve ambiguity in practice.

MiDaS (Ranftl et al., 2019+) — relative depth from anywhere. Instead of predicting METRIC depth in metres, predict RELATIVE depth (ordinal or ratio). Much easier — you only need to know 'the bench is closer than the tree', not 'the bench is 4.2 m away'. Why this matters: relative-depth signals are everywhere — stereo pairs (disparity → relative depth), 3D movies, structure-from-motion, synthetic data, even web video provides ordinal depth via motion parallax. Loss is SCALE-AND-SHIFT-INVARIANT: align prediction and target by the optimal $(s, t)$ that minimises $\sum_{i} ∣ s \cdot d_{i} + t - d_{i}^{*} ∣$ before computing the residual. This lets MiDaS train on many heterogeneous datasets that wouldn't combine if forced to a common metric scale, producing a model that generalises to arbitrary images.

ZoeDepth (Bhat et al., 2023) — relative + metric. Pretrain MiDaS-style on heterogeneous data (relative depth generalises). Then fine-tune SEPARATELY on metric depth datasets (KITTI for outdoor, NYU for indoor) with metric heads. At inference, predict relative depth (still generalises) then convert to metric via the appropriate scene-type head. Zero-shot metric depth.

3D point cloud from depth. Combine a monocular depth map with a pinhole-camera model: back-project each pixel $(u, v, d)$ to a 3D point $(X, Y, Z)$ via $X = (u - c_{x}) \cdot d / f_{x}$ , $Y = (v - c_{y}) \cdot d / f_{y}$ , $Z = d$ . You now have a 3D point cloud from a single 2D photo — bridge back to the point-cloud chapter. Modern AR apps do this in real-time.

Multi-task dense prediction. Do depth, normals, and segmentation share features? Mostly yes — all require understanding scene structure. Eigen & Fergus (CVPR 2015) showed one network with one shared encoder and three task-specific heads can predict all three simultaneously. Precursor to modern 'foundation vision models' like Sapiens (Meta, ECCV 2024) which uses the same recipe at much larger scale.

Losses for dense prediction. Vanilla CE: per-pixel cross-entropy; fails badly when one class dominates (90% sky → model predicts 'sky everywhere' for low loss but useless output). Balanced CE: reweight inversely to class frequency. Dice loss = $1 - Dice$ : optimises overlap directly; naturally robust to imbalance — common in medical imaging where foreground is small. IoU / GIoU loss: $1 - IoU$ optimises the metric directly. GIoU (Rezatofighi et al., CVPR 2019) extends IoU to handle non-overlapping boxes via the smallest-enclosing-box penalty. FOCAL LOSS (RetinaNet, Lin et al., ICCV 2017): $FL (p_{t}) = - (1 - p_{t})^{γ} lo g p_{t}$ with $γ = 2$ . Modulating factor $(1 - p_{t})^{γ}$ approaches 0 for well-classified examples ( $p_{t} \to 1$ ) — up to 100× down-weighting at $p_{t} = 0.9, γ = 2$ . Easy backgrounds suppressed by orders of magnitude; hard positives keep their loss. Fixed the long-standing imbalance that let two-stage detectors beat single-stage. $γ = 0$ recovers ordinary CE.

Definitions

Semantic / Instance / Panoptic segmentation — Per-pixel class label / class + instance ID for things only / class + instance ID for everything (things and stuff). Panoptic is the most complete.
Things vs Stuff — Things: countable objects with distinct instances (person, car, animal). Stuff: amorphous, uncountable regions identified by texture/material (sky, road, water). Panoptic mixes both correctly.
FCN — Fully Convolutional Network (Long et al., CVPR 2015). Replace the classifier's FC with $1 \times 1$ conv → per-pixel class map. Encoder downsamples; decoder upsamples; arbitrary input size.
Transposed convolution — Learnable upsampling: input pixel scales the filter at the output position; overlapping contributions sum. Equivalent matrix view: $y = W^{⊤} x$ . NOT 'deconvolution'.
U-Net — Symmetric encoder-decoder with CONCAT skip connections at every resolution. Decoder gets deep semantic + shallow local features. Originally MICCAI 2015 medical imaging; now ubiquitous (Stable Diffusion's denoiser is a U-Net).
Dilated/Atrous convolution — Conv with gaps of size $(rate - 1)$ between kernel taps; multiplicatively expands receptive field without parameter growth or resolution loss. DeepLab's foundation. Risk: gridding artifacts at high rates.
ASPP (Atrous Spatial Pyramid Pooling) — Parallel branches of atrous convs at multiple rates (e.g., 6, 12, 18) for multi-scale context. Used in DeepLab v3.
Mask R-CNN — Faster R-CNN + a third FCN head producing $28 \times 28$ binary mask per RoI per class. ICCV 2017, 41 k+ citations. Loss = $L_{cls} + L_{box} + L_{mask}$ .
RoI Align — Bilinear interpolation at exact float coordinates with 4 sample points per bin — NO rounding. Replaces RoI Pool's quantisation; critical for pixel-precise masks.
PointRend — Adaptive boundary refinement (Kirillov et al., CVPR 2020). Coarse mask → identify uncertain pixels (prob ≈ 0.5) → point-MLP refinement on high-res features → iterate. Sharper boundaries at modest extra cost.
Dice coefficient — $Dice (A, B) = 2∣ A \cap B ∣/ (∣ A ∣ + ∣ B ∣)$ . Same ranking as IoU. Denominator is SUM, not union. Dice slightly favours small objects.
mIoU — Mean IoU across classes. The standard segmentation metric — robust to class imbalance, unlike pixel accuracy.
MiDaS — Relative monocular depth (Ranftl et al., 2019+). Scale-and-shift-invariant L1 loss lets training combine heterogeneous depth sources (stereo, SfM, synthetic, web video).
ZoeDepth — MiDaS-style relative depth pretraining + metric depth fine-tuning on KITTI/NYU. Zero-shot transfer to metric depth (Bhat et al., 2023).
Focal Loss — $FL (p_{t}) = - α (1 - p_{t})^{γ} lo g p_{t}$ with $γ = 2, α = 0.25$ . Modulating factor crushes easy-classified examples; rebalances dense detection / segmentation. RetinaNet, ICCV 2017.

Formulas

$Dice (A, B) = \frac{2∣ A \cap B ∣}{∣ A ∣ + ∣ B ∣} = \frac{2 IoU}{1 + IoU}$
$mIoU = \frac{1}{C} c = 1 \sum C \frac{∣ A _{c} \cap B _{c} ∣}{∣ A _{c} \cup B _{c} ∣}$
$Focal Loss: FL (p_{t}) = - α_{t} (1 - p_{t})^{γ} lo g p_{t}, γ = 2, α = 0.25$
$Dice loss: L_{Dice} = 1 - \frac{2∣ P \cap G ∣}{∣ P ∣ + ∣ G ∣}$
$GIoU loss: L_{GIoU} = 1 - IoU + \frac{∣ C ∖ ( A \cup B ) ∣}{∣ C ∣}$
$Scale-shift-invariant L1: s, t min i \sum ∣ s \cdot d_{i} + t - d_{i}^{*} ∣$
$Transposed conv (matrix view): y = W^{⊤} x (same params, transpose direction)$
$Pinhole back-projection: X = \frac{( u - c _{x} ) d}{f _{x}}, Y = \frac{( v - c _{y} ) d}{f _{y}}, Z = d$
$Mask R-CNN loss: L = L_{cls} + L_{box} + L_{mask}$
$Per-pixel BCE on mask: L_{mask} = - \frac{1}{m ^{2}} i, j \sum y_{ij} lo g \overset{p}{^}_{ij} + (1 - y_{ij}) lo g (1 - \overset{p}{^}_{ij})$

Derivations

**Dice $= 2 \cdot IoU / (1 + IoU)$ .** Let $x = IoU = ∣ A \cap B ∣/∣ A \cup B ∣$ . Then $∣ A \cap B ∣ = x \cdot ∣ A \cup B ∣$ . By inclusion-exclusion $∣ A ∣ + ∣ B ∣ = ∣ A \cup B ∣ + ∣ A \cap B ∣ = ∣ A \cup B ∣ (1 + x)$ . Substituting into Dice's formula: $Dice = 2∣ A \cap B ∣/ (∣ A ∣ + ∣ B ∣) = 2 x \cdot ∣ A \cup B ∣/ (∣ A \cup B ∣ (1 + x)) = 2 x / (1 + x)$ . Hence both metrics rank predictions identically (monotonic in $x$ ); their values differ except at $x = 0$ and $x = 1$ .

**Focal loss numerics — why $γ = 2$ helps imbalance.** Standard CE: $- lo g p_{t}$ . Well-classified pixel with $p_{t} = 0.9$ : loss $= - lo g 0.9 \approx 0.1$ . With 10,000 such background pixels: collective loss $\approx 1000$ . Now consider 10 hard foreground pixels at $p_{t} = 0.1$ : collective loss $\approx 10 \cdot 2.3 = 23$ . The easy background dominates by 40×. Focal Loss multiplies by $(1 - p_{t})^{γ}$ . At $γ = 2$ : $(0.1)^{2} = 0.01$ for the easy pixel (100× suppression) and $(0.9)^{2} = 0.81$ for the hard pixel (~unchanged). Now the easy background contributes $10000 \cdot 0.1 \cdot 0.01 = 10$ , hard foreground contributes $10 \cdot 2.3 \cdot 0.81 \approx 19$ — hard examples now dominate. The gradient is dominated by what the model gets wrong, not by what it's already right about.

Why pixel accuracy is a trap. A scene where 95% of pixels are background. A trivial model that predicts 'background everywhere' scores $0.95$ pixel accuracy — looks great. But its per-class recall on the 5% foreground is 0. mIoU averages IoU per class — for the background class IoU might be 0.95 (the model nailed the dominant class), but for the foreground class IoU is 0 (no overlap at all). $mIoU = (0.95 + 0) /2 = 0.475$ — much harsher and correctly informative. Always use mIoU on imbalanced segmentation.

RoI Align bilinear sampling. For a $7 \times 7$ RoI Align output, project the floating-point RoI onto the feature map and divide into 49 bins (each with floating-point boundaries — no rounding!). Per bin, place 4 regularly-spaced SAMPLE POINTS. At each sample point $(x, y)$ — float coordinates — the feature value is the bilinear interpolation of the 4 surrounding feature-map cells $(x_{0}, y_{0}), (x_{1}, y_{0}), (x_{0}, y_{1}), (x_{1}, y_{1})$ : $f (x, y) = (1 - d x) (1 - d y) f_{00} + d x (1 - d y) f_{10} + (1 - d x) d y \cdot f_{01} + d x \cdot d y \cdot f_{11}$ where $d x = x - x_{0}, d y = y - y_{0}$ . Average (or max) the 4 sampled values for the bin's output. Differentiable end-to-end. Critical for mask quality.

Examples

**FCN's $1 \times 1$ conv replaces the classifier.** ResNet-50 backbone outputs a $7 \times 7 \times 2048$ feature map on a $22 4^{2}$ input. Replace the global-pool-then-FC head with a $1 \times 1$ conv $2048 \to K$ (where K = number of classes). Output: $7 \times 7 \times K$ — a per-position class map. Upsample $32 \times$ via transposed conv to recover the original $224 \times 224 \times K$ prediction. Skip connections from earlier layers (pool3 at stride 8, pool4 at stride 16) sharpen boundaries (FCN-8s).
Pixel accuracy fails on imbalanced scenes. 90% of pixels are sky → 'predict sky everywhere' gives 0.9 pixel accuracy but ≈ 0 mIoU because the rare classes have IoU 0.
Medical segmentation uses Dice loss. Foreground (tumour) is typically < 5% of voxels. Vanilla CE is dominated by easy background gradients; model converges to 'all background'. Dice loss optimises the foreground overlap directly — naturally robust. Common practical loss: $α \cdot L_{Dice} + (1 - α) \cdot L_{CE}$ with $α \approx 0.5$ — Dice for imbalance robustness + CE for stable early gradients.
DeepLab v3 ASPP uses parallel atrous branches with dilation rates ${6, 12, 18}$ . Each branch sees the same feature map at a different effective receptive field — multi-scale context without resolution loss.
Mask R-CNN per-class mask numerics. COCO has 80 classes. Mask head output per RoI: $80 \times 28 \times 28$ — 62 720 logits per RoI. Only the mask for the predicted class is used at inference (e.g., if class = 'cat', use mask[15] of the 80 channels). Training: loss computed only on the GT-class channel — other channels get no gradient for that example.
Focal loss in numbers. $γ = 2$ table: $p_{t} = 0.9 \to (1 - p_{t})^{γ} = 0.01$ (100× down-weighted); $p_{t} = 0.7 \to 0.09$ (~11×); $p_{t} = 0.5 \to 0.25$ (4×); $p_{t} = 0.3 \to 0.49$ (~2×); $p_{t} = 0.1 \to 0.81$ (barely changed). Tunable: $γ = 0$ recovers vanilla CE; higher $γ$ more aggressive focusing.

Diagrams

FCN-32s vs FCN-8s side-by-side: 32× upsample directly from the deepest layer vs adding pool3 and pool4 skip connections fused with upsampled deep features. Annotate where boundary sharpness is gained.
U-Net U-shape: encoder downsamples on the left, decoder upsamples on the right, four horizontal CONCAT skips at each resolution. Mark the bottleneck.
Transposed conv mechanism: input pixel value scales a copy of the filter placed at the output; overlapping contributions sum.
RoI Pool vs RoI Align side-by-side: Pool quantises twice (RoI → cells, cell → sub-bin); Align uses 4 bilinear sample points per bin at exact float coordinates.
Mask R-CNN architecture: shared backbone + RPN + per-RoI three-headed module (cls + box + mask FCN); mask channel selection at inference.
PointRend adaptive subdivision: coarse mask → identify pixels with prediction near 0.5 → MLP refinement on high-res features at those points → iterate.
Things vs Stuff on an example image: cars labelled with instance IDs, sky/road labelled with category only.

Edge cases

Gridding artifacts in dilated convs. At high dilation rates the kernel samples a sparse, regular grid — nearby pixels are never compared, producing checkerboard outputs. Mitigate with co-prime hybrid dilation rates (e.g., ${1, 2, 5}$ instead of ${2, 4, 8}$ ).
RoI Pool fails on small objects. Sub-pixel quantisation error dominates when the proposal is tiny. RoI Align is the fix.
Pixel accuracy is misleading on imbalanced scenes — 'predict majority class everywhere' scores high. Always use mIoU (or Dice) per class then average.
MiDaS depth is RELATIVE. Don't use raw MiDaS values as metric distances — they're up to an unknown scale and shift. Use ZoeDepth or calibration if you need metric units.
Boundary errors dominate on small objects. A few pixels of boundary error on a 20-px object is half its mass. PointRend explicitly addresses this; standard FCN/U-Net produce smeared edges on small things.
'Sky everywhere' adversarial input. A model trained on imbalanced data can fail catastrophically on rare-class images (e.g., a CT scan of a tumour-heavy patient where 'no tumour' isn't 95% anymore). Always evaluate on stratified test sets.
Stuff with instances is undefined. You cannot have 'sky #1, sky #2' — the panoptic spec forbids it. Make sure your annotation pipeline doesn't accidentally create stuff instance IDs.

Common mistakes

Conflating semantic and instance segmentation in panoptic. Panoptic is BOTH — per-pixel class label for everything, instance IDs only for things.
Using pixel accuracy as the headline metric. A 95% pixel accuracy on a 5%-foreground scene means nothing. Report mIoU.
Computing Dice with union in the denominator. Dice is $2∣ A \cap B ∣/ (∣ A ∣ + ∣ B ∣)$ — denominator is the SUM, not the union. IoU has the union.
Calling transposed convolution 'deconvolution'. Different operations — deconvolution inverts a known convolution; transposed conv just learns an upsampling kernel.
Forgetting that MiDaS depth is relative. Many students assume the output is in metres; it isn't. Plot it as a heatmap, not metric values.
Treating the mask head as class-agnostic. Mask R-CNN's mask head outputs K masks per RoI; only the predicted-class mask is used. Class-agnostic single-mask variants exist but lose some accuracy.
Mixing things-style instance IDs with stuff classes. Stuff has no instances. The panoptic format gives stuff classes a single implicit instance ID (typically 0).

Shortcuts

U-Net skip = CONCAT. ResNet skip = ADD. Don't confuse.
Transposed conv is LEARNABLE UPSAMPLING. NOT 'deconvolution' — wrong term.
Mask R-CNN mask head loss = per-pixel BCE on the GT-class mask only; other 79 classes get no gradient.
Dice ↔ IoU: $Dice = 2 \cdot IoU / (1 + IoU)$ . Same ranking, different values.
MiDaS depth = RELATIVE. ZoeDepth = relative + metric heads.
Focal loss: $γ = 2$ , $α = 0.25$ in the original paper. $γ = 0$ recovers CE.
Things = countable (have instances). Stuff = amorphous (no instances).

Proofs / Algorithms

Why mIoU is fair under class imbalance. Suppose class 1 has $N_{1}$ pixels and class 2 has $N_{2} ≪ N_{1}$ . Pixel accuracy averages over $N_{1} + N_{2}$ pixels, so class 1 dominates by $N_{1} / N_{2}$ . mIoU is $\frac{1}{2} (IoU_{1} + IoU_{2})$ — each class contributes equally regardless of pixel count. A trivial 'always class 1' predictor has $IoU_{1} = N_{1} / N_{1} = 1$ but $IoU_{2} = 0/ N_{2} = 0$ , so $mIoU = 0.5$ — much harsher than its pixel accuracy of $N_{1} / (N_{1} + N_{2}) \approx 1$ .

Focal loss bounded by CE. $FL (p_{t}) \leq - lo g p_{t}$ since $(1 - p_{t})^{γ} \leq 1$ for $p_{t} \in [0, 1]$ and $γ \geq 0$ . Equality at $p_{t} = 0$ (or $γ = 0$ ). So FL is always no larger than CE; the difference grows with $γ$ on well-classified examples.

Computer Vision