Revision Notes/Unit 3 — Pose Estimation/Pose Estimation — Heatmaps, CPM, OpenPose, SMPL

Pose Estimation — Heatmaps, CPM, OpenPose, SMPL

Intuition

A bounding box around a person tells you almost nothing about what they're doing — dancing, fighting, lifting a mug, falling. The answer lives in the body itself, in the geometry of arms, legs, torso, head. Pose estimation predicts joint locations and limbs so downstream tasks (activity recognition, motion capture, gesture interfaces, avatar animation) become tractable. The core architectural shift is from direct keypoint regression (broken for several reasons) to dense prediction via per-joint heatmaps.

Explanation

Four pose representations, in increasing detail. (1) SKELETON / KEYPOINTS — fixed list of K body keypoints, each $(x, y)$ (or $(x, y, c)$ with confidence). COCO uses 17 (nose, eyes, ears, shoulders, elbows, wrists, hips, knees, ankles); MPII 16; Halpe 26+; DWPose includes face + hands. Output: $K \times 2$ numbers per person. (2) DENSEPOSE — instead of K points, map every visible body pixel to a coordinate in a CANONICAL 2D surface parametrisation of the body (UV-mapping a 3D body to image pixels). Per-pixel surface correspondence; useful for clothing transfer, virtual try-on, dense cross-view correspondences. (3) BODY MESH (SMPL) — full 3D: a $6890$ -vertex parametric mesh with $β \in R^{10}$ shape parameters (PCA components — tall/short, slim/wide) and $θ \in R^{72}$ pose parameters (24 joints × 3 axis-angle). Given $(β, θ)$ SMPL computes the posed mesh deterministically. (4) FOUNDATION-MODEL REPRESENTATIONS (Sapiens, Meta, ECCV 2024) — one ViT-based backbone trained on 300M+ human-centric images that simultaneously predicts pose, segmentation, depth, surface normals from one image via lightweight task heads. The modern direction.

Real-world deployment. Kinect for Xbox 360 (2010) was one of the first commercially deployed CV applications — real-time pose from depth maps using random forests (Shotton et al., CVPR 2011). Today's descendants power motion capture in film, AR fitness, sign-language recognition, MimicMotion-style avatar animation (Tencent 2024: single image + pose sequence → animated video).

The naive baseline and why it fails. Image → CNN backbone → 2048-d feature → linear → $K \times 2 = 34$ numbers (the keypoint coordinates). Train with L2 on $(x, y)$ . The lecture asks WHY this fails and lists four reasons: (1) L2 IN PIXEL SPACE IS BRUTAL — a 5-pixel error on a $22 4^{2}$ image is visually small but L2 penalises it the same as a 5-pixel error on a $200 0^{2}$ image; poorly conditioned loss landscape. (2) NO SPATIAL REASONING — the CNN compresses the image to a global vector before predicting coordinates; the 2D structure is destroyed in the bottleneck, but pose estimation IS fundamentally spatial. (3) NO UNCERTAINTY — direct regression outputs a single point; if two arm positions are plausible (behind vs in front of the body), the model averages and produces a nonsense midpoint. (4) NO OCCLUSION HANDLING — joints out of view force the model to hallucinate values. The fix: predict HEATMAPS, not coordinates.

Pose as dense prediction — the heatmap revolution. Output $K$ 2D heatmaps $H_{k} \in R^{H^{'} \times W^{'}}$ , one per joint type. Each $H_{k} (i, j)$ is the probability that keypoint $k$ is at pixel $(i, j)$ . Ground truth: Gaussian centred at each true keypoint, $G_{k} (i, j) = exp (- \frac{( i - x _{k} ) ^{2} + ( j - y _{k} ) ^{2}}{2 σ ^{2}})$ . Loss: $L = \sum_{k} ∥ H_{k} - G_{k} ∥^{2}$ — per-pixel MSE. Inference: $\overset{p}{^}_{k} = ar g max_{i, j} H_{k} (i, j)$ (or weighted softargmax for differentiability). Optional parabola fit on the 3 max + neighbours for sub-pixel accuracy. Spatial structure preserved; multi-peak heatmaps express uncertainty; loss has strong gradient everywhere; occluded joints can predict low-confidence (peak height < threshold). This reframing — regression to dense prediction — is what made modern pose estimation work.

Convolutional Pose Machines (CPM, Wei et al., CVPR 2016). Treats pose as iterative refinement of 'belief maps' across $T$ stages. Stage 1: image → CNN → belief_map_1. Stage $t > 1$ : $(image features, belief_map_{t - 1}) \to CNN_{t} \to belief_map_{t}$ . The crucial idea: each subsequent stage has a LARGER receptive field, so it uses long-range spatial dependencies between joints — 'the head is here; therefore the neck is just below; therefore the shoulders are at these likely locations'. Each stage's belief map is supervised with the ground-truth Gaussian heatmap (INTERMEDIATE SUPERVISION) — combats vanishing gradients in the deep cascade. Final output is the last stage's belief map. Structurally similar to iterative refinement in modern diffusion / VLMs.

Evaluation — PCK (Percentage of Correct Keypoints). A predicted keypoint is 'correct' if its distance to the ground-truth keypoint is below a threshold. Two common normalisations: PCKh@0.5 — threshold = $0.5 \times$ head segment length (the 'head bone') — used on MPII; PCK@0.2 — threshold = $0.2 \times$ torso diameter — used on FLIC. Why normalise? A 5-pixel error on a giant person filling the frame versus a 5-pixel error on a distant person are wildly different in real-body terms. Normalising by body size gives a scale-invariant metric. Higher PCK is better; closer to 100% is the target. Head bone is preferred because torso changes more under pose articulation.

Multi-person pose — three challenges the lecture flags. (1) UNKNOWN NUMBER OF PEOPLE in the image — could be 1, could be 50. Networks like fixed-size outputs. (2) INTERACTIONS AND OCCLUSIONS between people mess up predictions — joint candidates get tangled. (3) RUNTIME should ideally not grow with the number of people. Two paradigms emerged: TOP-DOWN (Mask R-CNN keypoints, AlphaPose) and BOTTOM-UP (OpenPose).

OpenPose (Cao et al., CVPR 2017) — the canonical bottom-up approach. Recipe: image → CNN → two branches. Branch 1 predicts $K$ keypoint heatmaps (all keypoints across all people in one shot — a 'soup' of candidates: maybe 7 shoulders, 6 elbows, 9 wrists). Branch 2 predicts PART AFFINITY FIELDS — a 2D vector field per limb type (one for right-shoulder-to-right-elbow, one for right-elbow-to-right-wrist, etc.). At each pixel along a limb, the PAF stores a unit vector pointing along the limb direction; elsewhere zero. Assembly: for each candidate $(A, B)$ pair score by integrating the PAF along the line from $A$ to $B$ : $Score (A, B) = \int_{0}^{1} L_{c} (p (u)) \cdot \overset{v}{^}_{A B} d u$ , where $p (u) = (1 - u) A + u B$ is the parametric line and $\overset{v}{^}_{A B}$ is the unit direction. High integral = limb actually connects them. Then Hungarian bipartite matching per limb type assembles consistent skeletons. Output channels: $K + 2 L$ (K heatmaps + 2 channels per limb for x/y of the PAF vector); e.g., 18 keypoints + 19 limbs → 18 + 38 = 56 channels.

Why PAFs are the clever bit. Without them you'd need a geometric heuristic per limb type (length range, angle range). PAFs let the network LEARN the limb-association cue itself — encoded as a direction vector field across the image. The Hungarian matching becomes a clean integer-programming step on learned scores. BOTTOM-UP semantics: detect all keypoints in one pass; runtime is roughly constant in number of people (keypoint detection is fixed-cost; only matching scales).

Top-down approach (Mask R-CNN keypoints). Detect each person first (Faster R-CNN backbone → bounding boxes), then run single-person pose estimation on each cropped region. Add a fourth head to Mask R-CNN: per-RoI keypoint head outputs a $28 \times 28$ (or larger) heatmap for each of the $K$ joints. For each detected person you get $K$ heatmaps localising joints within their bounding box. Runtime scales LINEARLY with number of detected people.

Top-down vs Bottom-up — the canonical trade-off. Top-down: HIGHER per-person accuracy (sees one isolated person, no association ambiguity); runtime $\propto P$ ; BRITTLE — missed detection means missed pose. Bottom-up: LOWER per-person accuracy (must disambiguate associations); runtime $\sim$ constant in $P$ ; ROBUST to detector misses. Crowded scenes: bottom-up wins. Few people, accuracy-critical (golf swing analysis on a launch monitor): top-down. Real-time crowd at 30 fps on a smartphone: bottom-up.

2D → 3D via SMPL. Skinned Multi-Person Linear Model (Loper et al., SIGGRAPH Asia 2015): template mesh of 6,890 vertices in a canonical pose + shape blendshapes parameterised by $β \in R^{10}$ (principal components across body types) + pose blendshapes parameterised by $θ \in R^{72}$ (24 joints × 3 axis-angle each). Forward: start from template → apply shape deformations → apply pose deformations via LINEAR BLEND SKINNING (each vertex is influenced by nearby joints by learned skinning weights). Output: $M (β, θ) = W (T_{P} (β, θ), J (β), θ, W)$ — a full 3D mesh, deterministic and differentiable in $(β, θ)$ .

Human Mesh Recovery (HMR, Kanazawa et al., CVPR 2018). Predict $(β, θ)$ from a monocular image. Standard architecture: CNN backbone (ResNet-50) → regress $(β, θ, camera)$ → SMPL forward → 3D mesh + 3D joints. Loss: project 3D joints back to 2D using the predicted weak-perspective camera, compare to GT 2D keypoints: $L_{2 D} = ∥ proj (SMPL (β, θ)) - keypoints_{2 D} ∥^{2}$ . Optional: 3D supervision when available, adversarial loss on $(β, θ)$ to keep predictions plausible (avoid impossible poses). Bridges 2D pose and 3D body modelling.

Sapiens (Meta, ECCV 2024) — foundation models for humans. ViT backbone pretrained on 300M+ human-centric images with task-specific heads for pose, segmentation, depth, and surface normals — all share one backbone. The recipe is now standard across vision: pretrain a big backbone on massive domain data, attach lightweight task-specific heads. Pose estimation is being subsumed by general human-understanding models.

Definitions

Skeleton / keypoint representation — Fixed list of K body keypoints, each $(x, y)$ or $(x, y, c)$ . COCO uses 17. Simplest and most common pose representation.
DensePose — Per-pixel mapping from image to canonical 2D body surface — UV-mapping a 3D body to image pixels. Dense correspondence.
SMPL — Skinned Multi-Person Linear Model (Loper et al., 2015). Template + $β \in R^{10}$ shape PCA + $θ \in R^{72}$ pose (24 joints × 3 axis-angle) → 6,890-vertex 3D mesh via linear blend skinning.
Human Mesh Recovery (HMR) — Regress SMPL parameters $(β, θ, camera)$ from a monocular image (Kanazawa et al., CVPR 2018). Trained with 2D reprojection loss to GT keypoints + optional 3D supervision + adversarial pose-plausibility loss.
Heatmap regression — Predict a 2D Gaussian belief map per joint instead of regressing $(x, y)$ . Per-pixel MSE against a Gaussian-centred target. argmax (or softargmax) at inference.
Convolutional Pose Machine (CPM) — Multi-stage pose architecture (Wei et al., CVPR 2016). Each stage refines belief maps using image features + previous stage's belief maps. Intermediate supervision at every stage combats vanishing gradients.
Part Affinity Field (PAF) — 2D vector field per limb type. At each pixel along a limb, stores a unit vector along the limb direction; zero elsewhere. Used to group keypoints via line-integral scoring.
PCK / PCKh — Percentage of Correct Keypoints. A keypoint is correct if its distance to GT is below a threshold normalised by body size. PCKh@0.5 uses 0.5 × head bone length; PCK@0.2 uses 0.2 × torso diameter.
Bottom-up pose — Detect all keypoints in the image first (one pass), then group into individuals via association cues (PAFs + Hungarian matching). OpenPose paradigm. Runtime $\sim$ constant in number of people.
Top-down pose — Detect person bounding boxes first, then run single-person pose estimation per box. Mask R-CNN keypoints / AlphaPose. Higher per-person accuracy; runtime $\propto P$ ; brittle on detection misses.
Sapiens — Meta's ECCV 2024 foundation model for humans. ViT backbone pretrained on 300M+ human-centric images; lightweight task heads for pose, segmentation, depth, normals. The 'foundation model + task heads' paradigm applied to human understanding.

Formulas

$Heatmap target: G_{k} (i, j) = exp (- \frac{( i - x _{k} ) ^{2} + ( j - y _{k} ) ^{2}}{2 σ ^{2}})$
$Heatmap loss: L_{pose} = k \sum ∥ H_{k} - G_{k} ∥_{2}^{2}$
$PAF score: Score (A, B) = \int_{0}^{1} L_{c} (p (u)) \cdot \overset{v}{^}_{A B} d u, p (u) = (1 - u) A + u B$
$PCK@ α correct ⟺ ∥ \overset{p}{^}_{i} - p_{i} ∥ \leq α \cdot d_{ref}$
$OpenPose output channels: K + 2 L$
$SMPL forward: M (β, θ) = W (T_{P} (β, θ), J (β), θ, W)$
$HMR loss: L_{2 D} = k \sum ∥Π (SMPL (β, θ)_{k}) - p_{k} ∥^{2}$
$Softargmax (sub-pixel): \overset{p}{^} = i, j \sum (i, j) \cdot \frac{exp ( β H ( i , j ))}{\sum exp ( β H )}$

Derivations

Why heatmap regression beats coordinate regression. (1) Spatial structure preserved: heatmap output is at $(H^{'}, W^{'})$ resolution; coordinate regression collapses to a 2-vector. (2) Uncertainty representable: heatmaps can have multiple peaks (front-of-body vs behind-body arm). (3) Smoother loss landscape: per-pixel MSE on a Gaussian-shaped target provides gradient everywhere (the loss decreases smoothly as the predicted peak moves toward GT), whereas coordinate L2 has a single non-zero gradient direction. (4) Occlusion handling: occluded joints can have low-amplitude heatmaps below a confidence threshold — explicit 'don't know' signal.

Parabola fit for sub-pixel argmax. After taking $ar g max$ at integer $(i_{0}, j_{0})$ , fit a parabola $f (x) = a x^{2} + b x + c$ to the three heatmap values $(H [i_{0} - 1, j_{0}], H [i_{0}, j_{0}], H [i_{0} + 1, j_{0}])$ . The peak of the parabola is at $x^{*} = - b / (2 a)$ . Apply independently in $i$ and $j$ for the sub-pixel refinement. Typical gain: $\sim$ 0.5-1 pixel accuracy improvement at no training cost.

Receptive-field growth in CPM. Stage 1 has receptive field $r_{1}$ (typically ~ $140$ px). Stage $t$ sees image features AND the previous stage's belief maps, so its effective receptive field includes information already aggregated over $r_{t - 1}$ . The cumulative RF grows roughly linearly with stage count. By stage 6 the effective RF covers most of a 368×368 input — enough to use 'head position implies shoulder position' kind of long-range constraints. Without intermediate supervision the gradient through 6 stages would vanish.

Linear blend skinning (SMPL). For each vertex $v_{i}$ with skinning weights $w_{i, j}$ (sums to 1 across nearby joints) and joint transforms $G_{j} (θ)$ , the posed position is $v_{i}^{'} = \sum_{j} w_{i, j} G_{j} (θ) (v_{i} + B_{S}^{i} (β) + B_{P}^{i} (θ))$ where $B_{S}$ and $B_{P}$ are shape and pose blendshapes. Smooth, differentiable in $(β, θ)$ . Limitation: produces 'candy-wrapper' artifacts at joints (severe twist degenerates the mesh) — fixed in higher-order models like SCAPE / STAR.

Examples

PCKh@0.5 numeric. Head bone length = 50 px. A predicted keypoint within 25 px of GT is correct. The head bone is used because it varies less than torso under pose articulation (torso shortens dramatically when sitting).
OpenPose inference trace. Encoder → 18 keypoint heatmaps + 38 PAF channels. argmax each heatmap → candidate keypoints per type. For limb 'left elbow → left wrist': for every (elbow_i, wrist_j) candidate pair, sample N = 10 points along the line, compute (PAF vector at point) · (unit line direction), average. High score = elbow_i and wrist_j connected by a real limb. Hungarian matching per limb type → assembled skeletons.
OpenPose channel count. 18 COCO keypoints + 19 limbs → output is $18 + 2 \cdot 19 = 56$ channels. Each PAF is 2 channels (x and y of the unit vector field).
HMR pipeline. Image → ResNet-50 → $(β, θ, camera params)$ regression → SMPL forward → mesh + 3D joints → project via camera to 2D → compare with GT 2D keypoints (L2D loss). Optional adversarial loss on $(β, θ)$ realism — discriminator says 'is this a real human pose or impossible?'.
SMPL parameter counts. $β = 10$ shape parameters (PCA components). $θ = 72$ pose parameters (24 joints × 3 axis-angle). Mesh = 6,890 vertices. Standard answer to 'how many parameters does SMPL have?' — total 82 learnable inputs that map to a 6890-vertex output.
Heatmap resolution choice. Input $368 \times 368$ , output heatmap $46 \times 46$ (8× downsampling). 8× is a sweet spot — fine enough to localise joints to a few input pixels after upsampling, coarse enough to keep training tractable.

Diagrams

OpenPose two-branch architecture: shared backbone → branch 1 outputs K keypoint heatmaps, branch 2 outputs 2L PAF channels; both refined over T stages with intermediate supervision.
PAF vector field overlay on a person image: arrows along each limb pointing from one keypoint to the other; zero elsewhere.
SMPL hierarchy: template mesh → shape blendshapes (β) → pose blendshapes (θ) → linear blend skinning → posed 6890-vertex mesh.
CPM stages diagram: belief maps refining across stages 1-6, with intermediate supervision losses at each stage's output.
Top-down vs bottom-up flowchart: top-down detects boxes first → per-person pose; bottom-up detects all keypoints + PAFs → Hungarian assembly.
PCKh diagram: head bone length shown on a person image, with a circle of radius 0.5 × head bone around the GT keypoint — the correctness threshold.

Edge cases

Top-down breaks when the upstream detector misses a person. Cascading failure — no detection means no pose at all.
Bottom-up grouping fails in dense crowds. PAF scores can prefer wrong pairings when many people overlap; Hungarian matching is only as good as the score matrix.
Heatmap argmax is integer-resolution. Use parabola fit on the 3 max + neighbours for sub-pixel accuracy, or softargmax for differentiable refinement.
SMPL doesn't handle clothes or accessories. It models the body surface only — wearing a coat will cause systematic underestimation of body width.
Symmetric pose ambiguity. Front-facing vs back-facing person can look similar in silhouette; without depth cues, networks sometimes flip left-right joint assignments. Multi-view or 3D supervision resolves this.
Severe occlusion in monocular HMR. With the lower body hidden, the SMPL model can still be regressed but with hallucinated leg positions; flag low-confidence predictions.
Children and atypical body shapes. SMPL's PCA basis was fit on adult bodies — children, very tall/short individuals, or non-cisnormative body types fit poorly. SMPL-X and SMPL-H extend the model to faces, hands, and broader populations.

Common mistakes

**'OpenPose output is $2 K$ channels.'** Wrong. It's $K + 2 L$ — one heatmap per keypoint + two channels per limb. With 18 keypoints + 19 limbs that's 56, not 36.
Forgetting PCK's normalisation. Always divide by body size — head bone or torso diameter — not absolute pixel distance. A model can score 99% PCK on one dataset and crash on another with different image sizes if you forget.
**Conflating SMPL $θ$ (72-d pose) with $β$ (10-d shape).** They have different roles and dimensionalities. Easy slip on a fast-answer exam question.
Treating top-down as universally more accurate. It's only more accurate WHEN the detector is reliable. In crowds with heavy occlusion, bottom-up wins.
Computing PAF score as a single dot product. It's an INTEGRAL — average dot product over N sample points along the line. Single-point sampling is too noisy.
Assuming heatmap argmax is always the joint. It's the argmax of a NOISY map; can be wrong when the peak is shallow. Use confidence thresholds.
Mixing 2D and 3D coordinate frames in HMR. SMPL is in metres in canonical 3D; the camera projects to 2D pixel coordinates. Sign errors and unit-mismatches are common bugs.

Shortcuts

Numbers to memorise: $K + 2 L$ channels (OpenPose); $β = 10$ , $θ = 72$ , mesh $= 6890$ vertices (SMPL); 17 COCO keypoints; PCKh@0.5 on MPII.
PAF score = LINE INTEGRAL of (PAF $\cdot$ limb direction). Bipartite matching via Hungarian algorithm.
Top-down: O(P) runtime, higher per-person accuracy, brittle on missed detections.
Bottom-up: O(1) in P, lower per-person accuracy, robust to misses, wins in crowds.
Heatmap regression $>$ coordinate regression on ALL four axes: spatial reasoning, uncertainty, loss landscape, occlusion.
CPM: $T$ stages, intermediate supervision at each, RF grows with depth, $T \approx 6$ typical.
Sapiens = foundation-model approach: one ViT backbone, many task heads.

Proofs / Algorithms

Softargmax differentiability. Standard argmax is non-differentiable. Softargmax: $\overset{p}{^} = \sum_{i, j} (i, j) \cdot softmax (β H (i, j))$ , where $β$ controls peakiness. As $β \to \infty$ this approaches argmax; for finite $β$ it's smooth, differentiable, and recovers the exact integer index when $H$ is a delta. Lets end-to-end gradient flow from a final $(x, y)$ loss back to the heatmap network — used by integral pose regression methods.

Heatmap variance and loss conditioning. With Gaussian GT of variance $σ^{2}$ , the loss $∥ H - G ∥_{2}^{2}$ has gradient magnitude $\propto σ^{- 1}$ near the peak — too small $σ$ gives sharp peaks but flat gradients far from the peak (vanishing signal); too large $σ$ gives wide peaks but blurry localisation. Standard choice: $σ \approx 2$ pixels at $46 \times 46$ heatmap resolution. Balances signal-to-noise.

Computer Vision