Object Detection — R-CNN family, YOLO, NMS, mAP
Intuition
Detection answers two questions per object: 'where' (a bounding box) and 'what' (a class). The history of the field is essentially the story of moving the expensive parts of that pipeline — region proposals, per-region computation, ranking — onto the GPU and into a single shared backbone. By 2016 the two-stage R-CNN family had matured (R-CNN → Fast → Faster); YOLO then asked 'why two stages at all?' and reframed detection as a single regression problem in one CNN forward pass.
Explanation
The task staircase. Classification predicts one label per image. Classification + localisation adds one box (x, y, w, h) for an image known to contain ONE object. Detection drops that assumption — variable number of objects per image, each needing a (label, box) pair. Instance segmentation goes further: per-instance pixel masks. Detection is harder than localisation for three concrete reasons: variable-length output (1 object vs 47), mixed output types (discrete class + continuous box), and resolution (classification works at ; detection needs ~ because tiny objects vanish at low resolution).
Modal vs amodal boxes. A modal box covers only the visible portion (the dog's box stops at the chair that occludes it). An amodal box covers the full extent even through occlusion. PASCAL VOC, COCO and standard benchmarks use MODAL boxes — this is a common one-mark trap.
IoU = Jaccard index. . Symmetric. Calibration: decent, good, near-perfect. Important warning: IoU = 0.5 is NOT 'boxes overlap by 50%' — two equal squares sharing only one-third of each other can have IoU = 0.5. Don't reason from it as a percentage.
Idea #1 — localisation as regression. With exactly one object per image, slap a regression head on a CNN that outputs four numbers (x, y, w, h), trained with L2 loss against the ground-truth box. Two flavours exam questions test: class-AGNOSTIC (4 outputs, single box) vs class-SPECIFIC (C × 4 outputs, one box per class). Generalises to K objects only if K is fixed — not real detection.
Idea #2 — sliding window (Overfeat, ICLR 2014, ILSVRC 2013 winner). Run the network at every position and scale; classify each window; regress a box; merge. The clever trick: convert fully-connected layers into 1×1 convolutions so all positions are handled by a single forward pass — no Python-level sliding loop. Cost: still expensive across scales; can't gracefully handle variable object counts.
Idea #3 — propose then classify. Treat detection as classification on a small set of CANDIDATE windows. The class-agnostic proposer is Selective Search (Uijlings et al., IJCV 2013): a classical bottom-up segmentation algorithm that starts from oversegmentation and greedily merges similar regions using colour, texture, size and fill similarity measures. Output: ~2000 boxes per image. Not learned, runs on CPU, fast enough — and the workhorse for the entire R-CNN family.
R-CNN (Girshick, CVPR 2014). Five training steps drawn straight from the lecture: (1) pretrain a CNN on ImageNet; (2) fine-tune the classification head to classes (PASCAL classes + background) on detection data; (3) extract pool5 features for every proposal — about 2000 per image, each warped to — and cache to disk (about 200 GB for the PASCAL set!); (4) train one binary SVM per class on those cached features; (5) train a class-specific box regressor predicting offsets in normalised coords. Test time: proposals → warp → CNN → SVM scores + box refinement → NMS → done. Three deep flaws: SLOW at test time (~47 s/image because ~2000 CNN forwards per image on overlapping crops); POST-HOC training (CNN is frozen before the SVMs and regressors are trained, so the features can't adapt to what's actually useful); MULTI-STAGE pipeline (five separate stages, none end-to-end).
Fast R-CNN (Girshick, ICCV 2015). Key insight: 2000 forward passes are wasteful because the proposals all come from the same image — the early conv layers redo the same work. Run the backbone ONCE on the full image, then crop per-proposal from the feature map via RoI Pooling: project the proposal box onto the feature map (scale by the network's stride), divide the projected region into a fixed grid (typically ), and max-pool inside each cell. Output is a uniform tensor per proposal — same shape for every region regardless of proposal size. RoI Pool is differentiable, so the whole network trains end-to-end with one multi-task loss (CE on classification + smooth-L1 on box regression). Numbers: training (~8.8×), test (~146×), mAP on VOC 2007. Problems #2 (post-hoc) and #3 (multi-stage) are gone.
Faster R-CNN (Ren et al., NIPS 2015) — the RPN. Fast R-CNN's 0.32 s test time hides a ~2 s Selective Search step running on CPU. Replace that with a Region Proposal Network that shares the CNN backbone. After the last conv layer, slide a conv over the feature map; two conv heads branch off — one for objectness (binary: 'is this anchor an object?'), one for box regression. At each spatial location the RPN predicts anchors — typically scales 3 aspect ratios. Outputs per location: numbers. Two properties matter: TRANSLATION INVARIANCE (same anchor set at every location — the RPN learns what objects look like, not where they live) and ANCHOR-RELATIVE OFFSETS (regress — much easier to learn than raw coordinates because the anchor already gives a rough size/shape prior). Once the RPN gives proposals, the rest is Fast R-CNN. End-to-end test: ~0.2 s/img, mAP 66.9, speedup over R-CNN. Two stages, one network.
YOLO v1 (Redmon et al., CVPR 2016) — single-shot. Reframe detection as a SINGLE regression problem. Resize the image to , divide into an grid (), and let EACH CELL be responsible for the object whose CENTRE lies in that cell (not 'overlapping' — centre). Each cell outputs B = 2 candidate boxes, each with 5 numbers — centre relative to the CELL, size relative to the WHOLE IMAGE, objectness — plus C = 20 class probabilities SHARED across the cell's boxes. Output tensor: . Why two boxes? At training time both predict and we pick whichever has higher IoU with the ground truth as the 'responsible' predictor (the other is trained toward objectness 0). The two boxes specialise — one tends to learn tall, the other wide.
YOLO loss decomposition. Sum of squared errors, summed cell-by-cell, with two scaling weights to fix the object/no-object imbalance: upweights box-coord errors; downweights no-object confidence loss. Five terms: (1) box-centre on object cells; (2) box-size on object cells (the square root makes equal absolute pixel error matter more on small boxes — exam gold); (3) objectness on object cells (target = IoU with GT); (4) objectness on no-object cells (target = 0, weighted by ); (5) classification on object cells. Test-time post-processing: per-class confidence = box objectness class prob; threshold low scores; per-class NMS to clean overlaps.
YOLO v1's three limitations (the most likely exam question): (1) AT MOST 49 OBJECTS detectable — grid, one object per cell, so 50+ objects can't all be represented; (2) STRUGGLES WITH SMALL CLUSTERED OBJECTS (birds in a flock, faces in a crowd — multiple centres fall in one cell, only one survives); (3) POOR LOCALISATION because direct regression is harder than refining anchors. v2 onward fixed (3) by reintroducing anchors. YOLO generalised surprisingly well to non-natural images (paintings) where Selective Search struggled.
Two-stage vs single-stage trade-off. Two-stage (R-CNN family): propose → classify; historically higher mAP; slower; output is variable per-image. Single-stage (YOLO, SSD, RetinaNet): one forward pass; real-time; fixed-size tensor; historically lower mAP, closed by v3+ and Focal Loss. Two-stage 'thinks like a careful detective', single-stage 'thinks like a reflex'. For real-time (self-driving, robotics, video surveillance) use single-stage; for accuracy-critical (medical, security) use two-stage.
Losses for better gradients. GIoU adds a penalty (with = smallest enclosing box) so non-overlapping boxes still have non-zero gradient — pure IoU loss stalls at when boxes are disjoint. Focal Loss (with ) multiplies cross-entropy by a modulating factor that crushes easy-classified examples (high ) — essential for single-stage detectors where the background class dominates anchor counts ( tens of thousands of background anchors per image vs a handful of foreground).
Soft-NMS variants. Hard NMS zeros the score of suppressed boxes. Soft-NMS multiplies by a decay: linear or Gaussian . The Gaussian variant lets nearby legitimate objects (two pedestrians shoulder-to-shoulder, IoU ) keep some confidence rather than vanishing entirely.
Evaluation in detail — Precision, Recall, AP, mAP. Fix a class and an IoU threshold (say 0.5). Each detection is TP (matches an unmatched GT box with IoU ), FP (matches nothing, or claims an already-matched GT), or FN if a GT is unmatched. — fraction of your detections that are correct. — fraction of true objects you found. Sweep the confidence threshold: high threshold high precision, low recall; low threshold vice versa. AP = area under the resulting PR curve. mAP = mean over classes. VOC reports mAP@0.5 with 11-point interpolation (precisions at recall = averaged). COCO reports the AVERAGE of mAP@ — ten thresholds, 101-point interpolation — which is why COCO numbers look much smaller than VOC's despite the same detector.
Datasets to name-drop. PASCAL VOC 2010: 20 classes, ~20 k images, 2.4 objects/image. ImageNet Detection (ILSVRC 2014): 200 classes, ~470 k images, but only 1.1 objects/image — almost every image has one centred object, so ImageNet is great for classification and BAD for detection benchmarking. MS-COCO 2014: 80 classes, ~120 k images, 7.2 objects/image — the modern standard for crowded-scene evaluation.
What comes next. Mask R-CNN adds an instance-segmentation mask head on top of Faster R-CNN (Unit 2). FPN (Feature Pyramid Network) adds multi-scale features via top-down lateral connections so the same detector can find both small and large objects — standard in every modern detector. RetinaNet introduced Focal Loss to let single-stage finally beat two-stage. DETR (2020) reformulates detection as set prediction with attention — no anchors, no NMS.
Definitions
- Bounding box (modal vs amodal) — Modal: covers only the visible portion of the object; standard in PASCAL VOC, COCO, KITTI. Amodal: covers the full extent including occluded parts; used in specialised benchmarks.
- Anchor — A predefined box prior at fixed scale and aspect ratio; predictions are regression offsets from it. Faster R-CNN uses anchors per location (3 scales × 3 ratios).
- Selective Search — Classical bottom-up segmentation (Uijlings et al., IJCV 2013). Starts from oversegmentation, greedily merges similar regions using colour + texture + size + fill similarity. Produces ~2000 class-agnostic proposals per image. Workhorse of R-CNN and Fast R-CNN; not learned.
- RoI Pool — Project the proposal to the feature map (quantising to integer cells), divide into a fixed grid (typically ), max-pool per cell. Differentiable, but the two roundings cause sub-pixel misalignment in the input image — fatal for masks.
- RPN (Region Proposal Network) — Faster R-CNN's learned, backbone-sharing proposal generator. conv → two heads (objectness + box regression). Translation-invariant: same anchor set at every spatial location.
- FPN (Feature Pyramid Network) — Multi-scale feature representation with top-down lateral connections (Lin et al., CVPR 2017). High-resolution shallow features + low-resolution deep features fused via lateral convs. Lets one detector head handle small + large objects together. Standard in every modern detector.
- GIoU — Generalised IoU. Adds the penalty where is the smallest enclosing box. Non-zero gradient even when boxes don't overlap. Bounded in .
- Focal Loss — Cross-entropy multiplied by with (Lin et al., ICCV 2017). Crushes the gradient contribution of easy-classified examples — essential for single-stage detectors where background anchors vastly outnumber foreground.
- Non-Maximum Suppression — Per-class procedure: sort detections by score; keep the top one; suppress all with IoU > ; repeat. typical. Fails in dense crowds where legitimate overlapping objects get suppressed.
- Soft-NMS — Variant that DECAYS suppressed scores instead of zeroing them. Linear: . Gaussian: . Lets nearby legitimate objects coexist.
- mAP — Mean of per-class Average Precision (area under PR curve). VOC uses 11-point interpolation at IoU = 0.5. COCO averages mAP over IoU thresholds (101-point interp per threshold). COCO numbers are smaller because the metric is stricter.
- Smooth-L1 loss — Quadratic near zero (smooth gradient at the minimum), linear elsewhere (robust to outliers). Used for box regression in Fast/Faster R-CNN: if else .
- Overfeat — Sermanet et al., ICLR 2014 (ILSVRC 2013 detection winner). Sliding-window classification + box regression, made practical by converting FC layers into convs so the whole image is processed in one forward pass.
Formulas
Derivations
GIoU restores gradient on non-overlapping boxes. Pure IoU is zero whenever , so — no signal pulls a far-away predicted box toward the ground truth. GIoU adds where is the smallest enclosing box. Off-overlap, this term varies smoothly with the relative position of and , restoring a useful gradient direction. Bound: when , IoU = 1 and the enclosing-box term is 0 so GIoU = 1; when and are tiny and far apart, IoU and the term so GIoU . Hence GIoU .
**Why , in YOLO size loss.** Without : a 10-pixel error on a 400-pixel-wide box and a 10-pixel error on a 20-pixel-wide box contribute equally. But the small-box error matters far more in human terms (10 px is half the object!). Predicting and instead: , so the GRADIENT of the loss with respect to scales as — small boxes get larger gradient per pixel of width error, big boxes get smaller. Balances localisation across scales.
RoI Pool quantisation arithmetic — worked example. Suppose backbone stride is 16. Selective Search returns a proposal at image coordinates . Project to feature map by dividing by stride: float bounds . RoI Pool ROUNDS to integer cells: — already a sub-pixel shift in the original image. Now divide into a grid: bin width cells; RoI Pool rounds again at each bin boundary. The cumulative error can be feature pixel = original pixels in the worst case — fatal for mask prediction (Mask R-CNN's RoI Align fixes this with bilinear sampling at 4 points per bin and no rounding).
R-CNN's five training stages, traced. (1) Pretrain CNN (AlexNet) on ImageNet 1000-way classification. (2) Fine-tune classifier head to 21 classes (20 PASCAL + background); positives = proposals with IoU to any GT, negatives otherwise. (3) Run Selective Search on every training image, warp each proposal to , forward-pass through frozen CNN, save pool5 features to disk (~200 GB total). (4) Per class, train a binary SVM (cat vs not-cat, dog vs not-dog, …) on those cached features. (5) Per class, train a regressor predicting normalised offsets from proposal to GT. Test: proposals → warp → CNN → SVM scores + box refinement → NMS. The flaw cascade: CNN frozen by step 3, so SVMs in step 4 and regressors in step 5 can't shape what the CNN learns. Five disjoint training runs, none end-to-end — Fast R-CNN's RoI Pool is what made the whole pipeline a single optimiser.
AP worked example (the lecture's canonical trace). Five 'dog' detections sorted by score and 3 ground-truth dogs. Score 0.99 → TP (matches GT). Score 0.95 → TP (matches GT). Score 0.90 → FP (no unmatched GT with IoU ). Score 0.50 → FP. Score 0.10 → TP (matches GT). Cumulative table: after each detection . AP = area under the staircase. For this example AP = 0.86. Get AP = 1.0 only when every TP ranks above every FP and no GT is missed.
Examples
- RPN output arithmetic. Feature map with anchors per location. Total proposals = candidate boxes per image. Each location outputs numbers, so the RPN head produces scalars per image before filtering by objectness + per-batch top-N proposals.
- Standard RPN anchor config (Faster R-CNN). Three scales pixels, three aspect ratios — nine total per location, covering most common object shapes. Anchors are at the receptive-field centres of the feature map.
- Per-class NMS trace. Three detections after the RPN: (car, score 0.9), (car, score 0.8, IoU with = 0.7), (person, score 0.85, IoU with = 0.6). Threshold . Step 1: pick (highest car score) — keep; suppress (IoU > 0.5). Step 2: only remains in its class — keep. Final: . KEY: is NOT suppressed by despite IoU 0.6 because NMS is applied PER CLASS — car and person are independent.
- COCO vs VOC mAP example. Same detector. mAP@0.5 = 0.77 (VOC-style). mAP@0.55 = 0.71, mAP@0.60 = 0.65, …, mAP@0.95 = 0.20. COCO mAP = average of those ten = ~0.40. Both numbers describe the same model; COCO is just stricter.
- PASCAL VOC 2007 detector trace. 20 classes. Run on 5 k test images: get ~50 k detections total. Sort by score across all detections per class. Per class: walk the sorted list, mark TP/FP at IoU , build the PR curve, compute AP via 11-point interpolation. Average the 20 class APs → mAP. Faster R-CNN gets ~66.9 here.
- Speed evolution (one-line memorisation). R-CNN 47 s/img → Fast R-CNN 0.32 s (+ 2 s Selective Search outside) → Faster R-CNN 0.2 s → YOLO v1 22 ms (45 FPS) → Fast YOLO 6 ms (155 FPS).
Diagrams
- R-CNN → Fast → Faster evolution: per-region CNN forwards → shared backbone + RoI Pool → shared backbone + RPN. Annotate the bottleneck removed at each step (proposal cost, redundant convs, CPU Selective Search).
- RPN at one location: 9 anchors of varying scale and aspect ratio overlaid on the receptive field; two 1×1 conv heads emit objectness (binary) + box deltas (4 numbers) per anchor.
- RoI Pool diagram: a floating-point RoI projected on the feature map → quantised to integer cells → divided into a grid → max-pool inside each cell. Highlight the two rounding steps that cause misalignment.
- YOLO grid: the image divided into 7×7 = 49 cells; each cell's centre is marked; show one object whose centre falls inside one cell, with that cell predicting B = 2 candidate boxes.
- PR curve for the 5-detection AP example: a staircase from to , with area under the curve shaded.
- NMS trace as a small table: rows = detections sorted by score, columns = (class, score, IoU with surviving boxes, kept?, suppressed-by).
Edge cases
- NMS in dense crowds. Two pedestrians walking shoulder-to-shoulder have boxes with IoU . Hard NMS suppresses one as a duplicate even though both are real. Soft-NMS (linear or Gaussian decay) preserves both with reduced score; the lecture explicitly says of standard NMS: 'no good solution =('.
- Per-class vs global NMS. Applying NMS GLOBALLY across classes wrongly suppresses, e.g., a dog and a nearby person. Always partition by class first.
- Pure IoU loss for non-overlapping boxes. when — training stalls early because there's no gradient to pull the predicted box toward the GT. GIoU / DIoU / CIoU fix this.
- Small-object failure of vanilla detectors. YOLO v1 cell size is px; objects smaller than ~32 px lose their centre information when downsampled. RPN's smallest anchor at the final feature level is typically px on the input image. Fix: FPN — predict on multiple feature levels so small objects are detected on the high-resolution P2 level.
- YOLO's 49-object cap. grid hard-limits the prediction to 49 objects. Crowds, flocks, swarms exceed this. v2 onwards reintroduced more grid cells and anchors.
- Anchor scale mismatch. If the smallest anchor at the final feature level is still larger than typical objects in your domain (e.g., aerial imagery with 8-px boats), the RPN will never match them and recall on small objects collapses.
- ImageNet detection misleading. ILSVRC 2014 has 1.1 objects/image — almost every image is a centred singleton. Models that score well there can fail on COCO (7.2 obj/img) where occlusion and crowding matter.
Common mistakes
- 'YOLO predicts B class distributions per cell.' WRONG — YOLO predicts ONE class distribution per cell, SHARED across the boxes. That's why YOLO struggles when multiple object types share a cell.
- **Confusing and .** UPWEIGHTS box-coordinate loss (localisation matters more). DOWNWEIGHTS no-object confidence loss (so the abundant background cells don't drown out the few object cells). Mirror image, easy to swap.
- 'Cell containing the object' = 'cell overlapping the object'. WRONG — it's the cell whose centre POINT contains the object's CENTRE (not vice versa). The horse can span 20 cells; only the cell at the horse's geometric centre is responsible.
- Computing mAP from detections in raw order. Always sort by score FIRST. mAP is defined on the score-sorted PR curve.
- Applying NMS before splitting by class. Global NMS suppresses legitimate cross-class boxes. Always partition first.
- Saying 'IoU = 0.5 means 50% overlap'. Two equal squares can share only 1/3 each and still have IoU = 0.5. IoU is intersection / union, not a percentage of either side.
- Forgetting modal vs amodal. PASCAL VOC, COCO, KITTI all use MODAL boxes (visible part only). Amodal is a separate benchmark family.
Shortcuts
- Speed chain: R-CNN 47 s → Fast R-CNN 0.3 s → Faster R-CNN 0.2 s → YOLO v1 22 ms. Remember as 'forty-seven, thirty, twenty, twenty-two'.
- RPN numbers per location: 9 anchors, 54 outputs (). 3 scales 3 ratios.
- YOLO output: on PASCAL VOC. .
- Loss coefficients: . Five loss terms total.
- COCO vs VOC mAP: VOC at 0.5 only. COCO averages over — explains the smaller-looking numbers.
- Datasets: VOC 20 cls / 2.4 obj-per-img, ImageNet 200 cls / 1.1, COCO 80 cls / 7.2.
- Soft-NMS in one line: multiply suppressed scores by a decaying function of IoU instead of zeroing them.
Proofs / Algorithms
**GIoU .** Identical boxes: IoU = 1, , so GIoU = 1. Vanishingly small boxes far apart: IoU , , a fixed large value (the smallest enclosing box), so and GIoU . Hence GIoU .
IoU is not 'fractional overlap'. Counter-example: two unit squares overlapping by a region have and , so IoU = . The overlapping fractions on each side and IoU differ unless you pick the union as denominator — which is what the metric does and human intuition does not.
Why anchor-relative offsets are easier to learn. Predicting raw for an object somewhere in is high-variance. With anchor priors , the regression target is approximately zero-mean and unit-variance — well-suited to L2/smooth-L1 regression. The on the size makes the loss SCALE-INVARIANT (multiplying and by the same factor leaves unchanged).