Revision Notes/Unit 4 — 3D Data (PointNet, DGCNN, MeshCNN)/3D Representations — VoxNet, PointNet, PointNet++, DGCNN, MeshCNN

3D Representations — VoxNet, PointNet, PointNet++, DGCNN, MeshCNN

NotesStory

Intuition

A point cloud is a bag of $(x, y, z)$ samples — no grid, no ordering, no fixed count. A 2D CNN needs all three to work. So the field forks into two strategies: either force the data onto a grid and pay for it (VoxNet), or invent operators that are *symmetric* (permutation-invariant) by construction so they see a set, not a sequence (PointNet → PointNet++ → DGCNN → MeshCNN).

Explanation

Three properties of point clouds that break CNNs. Memorise these — the foundation of every architecture in this unit. (1) Unstructured — no grid, just floating points; you cannot index a $3 \times 3 \times 3$ neighbourhood by integer offsets. (2) Irregularly distributed — some regions dense (a car surface), others sparse (sky, holes). (3) Unordered — the same shape can be written as ${p_{1}, p_{2}, \dots, p_{N}}$ or as any of its $N!$ permutations; the network must give the same answer for all. Bonus: labelled 3D data is scarce — there is no ImageNet for point clouds, so methods must be data-efficient.

Five families of 3D representations — the spine of the whole unit. (i) *View-based*: render the 3D object from many camera angles and run a 2D CNN on each view (Multi-view CNN). (ii) *Volumetric*: discretise space into voxels and run a 3D CNN (VoxNet). (iii) *Point-based*: operate directly on the raw points (PointNet, PointNet++). (iv) *Graph-based*: treat points as graph nodes, k-nearest-neighbours as edges (DGCNN). (v) *Mesh-based*: exploit the triangle mesh's intrinsic geometry (MeshCNN). (A sixth — *fuzzy explicit* — is 3D Gaussian Splatting from Unit 5.)

Three tasks the unit targets. *3D classification* — one label per cloud ("this is a chair"), output $K$ class scores. *Part segmentation* — label each point with a part ("this point is a wing-tip"), output $N \times m$ score matrix. *Semantic segmentation* — label each point with a scene category ("car" vs "tree"), output $N \times m$ in scene context. The same backbone is reused; only the final head differs.

VoxNet — the sledgehammer. Overlay a 3D grid of cubes on the cloud; each voxel is occupied or empty (an *occupancy grid*); pass the resulting binary tensor through two 3D conv layers + FCs → class probabilities. Four problems — the most-quoted exam target. (1) Memory blows up cubically: a $102 4^{3}$ grid is ~1B voxels, a $25 6^{3}$ grid is 16M — both off-budget. (2) Most voxels are *empty* (point clouds live on surfaces) so ~99% of memory is zeros. (3) Resolution is brutally limited; most VoxNet papers use $3 2^{3}$ — postage-stamp resolution. (4) *Quantisation artefacts*: snapping continuous $(x, y, z)$ to discrete voxels turns smooth surfaces into staircases. VoxNet wins for dense, naturally volumetric data: medical CT, low-resolution CAD.

The naïve PointNet failure. If we just flatten $[x_{1}, y_{1}, z_{1}, \dots, x_{N}, y_{N}, z_{N}]$ and feed an MLP, reshuffling the points changes the input vector and so changes the output — *permutation variance*. The model learns the ordering of points, not the geometry. It also cannot generalise to a different $N$ . So a different mechanism is required.

PointNet's symmetric-function trick (Qi et al., CVPR 2017). Write the global feature as $f ({x_{i}}) = γ (max_{i} h (x_{i}))$ , where $h$ is a *shared* MLP applied independently to every point and $max$ is element-wise max across all $N$ points. Because $max$ is symmetric, the composition is symmetric — *permutation invariance is achieved by construction*. The three-step recipe to memorise: shared MLP per point → max-pool across points → MLP.

Universal Approximation Theorem (Qi et al.) — examinable. Any continuous symmetric (permutation-invariant) function on a compact set can be approximated arbitrarily well by a network of PointNet's form, given enough capacity in $h$ . So PointNet is not merely permutation-invariant — it is in principle the *universal* permutation-invariant architecture.

PointNet architecture in detail. $N \times 3$ input → a small T-Net predicts a $3 \times 3$ transformation matrix and applies it (input-alignment for rotation invariance) → shared MLP lifts each point to $64$ dims → another T-Net aligns features → shared MLP lifts to $1024$ dims → max-pool over all $N$ points → $1024$ -dim global feature. For classification: feed to MLP → $K$ scores. For segmentation: *concatenate the global feature back onto every per-point feature* (so each point sees both its local descriptor and the scene context), then more shared MLPs → per-point class scores.

Why MAX and not sum/average. Max picks the most salient point per feature dimension, giving a sparse, interpretable global descriptor. The subset of input points whose features survive the max-pool are called critical points and empirically trace the object's silhouette. PointNet is robust to dropping non-critical points but *vulnerable to targeted removal* of critical ones — a standard adversarial attack note.

PointNet's fatal weakness — and PointNet++. Each point is processed independently until the single max-pool. There is no notion of a local neighbourhood — no equivalent of a $3 \times 3$ receptive field. PointNet captures global shape but misses local geometric texture (e.g. the curvature where a chair-leg meets the seat) and ties its global feature to absolute coordinates. PointNet++ patches this: sample anchor points with Farthest Point Sampling (FPS), group neighbours around each anchor via ball query, apply PointNet *inside each group*, repeat hierarchically. This builds a CNN-like receptive-field hierarchy: early layers see local geometry, later layers see global structure.

DGCNN — the social network of points. A point cloud is really a graph: make every point a node, connect each to its $k$ nearest neighbours. The operation is EdgeConv: for point $i$ , find its $k$ nearest neighbours; for each neighbour $j$ , form an edge feature by concatenating $[x_{i}, x_{j} - x_{i}]$ — the point itself *plus the relative offset* to its neighbour. Pass through an MLP $h$ , then aggregate the $k$ edge features symmetrically (max or sum) into a new feature for point $i$ . The relative offset is what gives DGCNN its locality — geometry encoded the way image convolutions encode "this pixel relative to its neighbours". The clever bit: **rebuild the kNN graph in *feature* space, not coordinate space, after each layer** — early layers connect spatial neighbours, late layers connect *semantic* neighbours (all "wing-tip" points group together even if spatially far apart). Hence *Dynamic* GCNN.

DGCNN limitations. All-pairs distance computation is $O (N^{2})$ memory in features. Fixed $k$ cannot adapt to wildly varying density. There is no spatial downsampling across layers, so it scales poorly to huge clouds.

MeshCNN — when you actually have a mesh. A *mesh* has vertices joined into triangles — strictly richer than a point cloud because it encodes surface topology. MeshCNN's insight: treat edges (not vertices, not faces) as the convolutional unit, because every edge in a triangle mesh sits between exactly two faces, giving a fixed-size neighbourhood of 4 surrounding edges, just like a $3 \times 3$ image patch around a pixel.

MeshCNN's 5-D edge feature. Each edge gets: (1) dihedral angle between its two faces; (2) inner angle opposite the edge in face 1; (3) inner angle opposite in face 2; (4) length ratio of the edge to the opposite edge in face 1; (5) length ratio to the opposite edge in face 2. *Memorise:* dihedral angle, two inner angles, two length ratios. All five are intrinsic (invariant to rigid motion).

MeshCNN convolution — handling the ordering ambiguity. The four neighbouring edges $(a, b, c, d)$ have two valid orderings $(a, b, c, d)$ or $(c, d, a, b)$ depending on which face you start from. To make the conv invariant to that choice, MeshCNN feeds symmetric combinations: $(a + c, ∣ a - c ∣, b + d, ∣ b - d ∣)$ — sums and absolute differences are unchanged when the pair is swapped. The kernel has 5 weights $(k_{0}, \dots, k_{4})$ — one for the centre edge, one per symmetric channel.

MeshCNN pooling = learned edge collapse. Periodically, edges with the smallest learned activations are collapsed (their two endpoints merge, surrounding mesh topology updates, features average). This is *task-driven mesh simplification* — the network keeps edges that matter and discards the rest. *Mesh unpooling* stores the collapse history so the simplification can be reversed for tasks like segmentation that need original resolution.

Definitions

Symmetric function — A function $f$ satisfying $f (x_{1}, \dots, x_{N}) = f (π (x_{1}, \dots, x_{N}))$ for every permutation $π$ . Permutation-invariant by definition.
Voxelization / occupancy grid — Discretising a point cloud onto a regular 3D grid; each voxel marks occupied/empty (or stores density). Enables 3D convolution but at cubic memory cost.
PointNet — Shared per-point MLP $h$ + symmetric max-pool over points + final MLP $γ$ . Universal approximator for continuous symmetric set functions (Qi et al., 2017).
Critical points — The subset of input points whose per-point features actually survive PointNet's max-pool. They determine the global feature; perturbing other points has no effect.
T-Net — A mini-PointNet inside PointNet that predicts a learned alignment matrix ( $3 \times 3$ for input, $64 \times 64$ for features) and applies it before subsequent layers.
PointNet++ — Hierarchical PointNet — sample anchors via Farthest Point Sampling, group neighbours via ball query, apply PointNet locally, then stack. Adds local context that vanilla PointNet lacks.
Farthest Point Sampling (FPS) — Greedy subsampling: pick first point arbitrarily, then iteratively add the point with maximum minimum-distance to the chosen set. Yields evenly-spaced anchors.
EdgeConv (DGCNN) — Per-edge feature $h (x_{i}, x_{j} - x_{i})$ over a kNN graph, max-aggregated. The graph is dynamic — rebuilt in feature space at each layer.
Dynamic graph — The kNN edges of DGCNN, recomputed in the current feature space at every layer (early layers: geometric neighbours; late layers: semantic neighbours).
MeshCNN — Edge-centric mesh operator with a 5-D intrinsic edge feature and a 4-edge fixed neighbourhood; conv uses symmetric input combinations; pooling = task-driven edge collapse.
Dihedral angle — The angle between the two triangular faces meeting at a mesh edge. One of MeshCNN's five intrinsic features.

Formulas

$PointNet: f ({x_{1}, \dots, x_{N}}) = γ (i = 1 \dots N max h (x_{i}))$
$EdgeConv: e_{ij} = h_{Θ} (x_{i}, x_{j} - x_{i}), x_{i}^{'} = j \in N (i) max e_{ij}$
$MeshCNN 5-D edge feature: (α_{dih}, α_{a}, α_{b}, ∣ e_{a} ∣/∣ e ∣, ∣ e_{b} ∣/∣ e ∣)$
$MeshCNN symmetric conv inputs: (a + c, ∣ a - c ∣, b + d, ∣ b - d ∣) with kernel (k_{0}, k_{1}, k_{2}, k_{3}, k_{4})$
$VoxNet memory: O (N^{3}) voxels for grid resolution N$
$FPS update: p_{k + 1} = ar g p \in P max q \in S_{k} min ∥ p - q ∥$
$Per-point seg head: \overset{y}{^}_{i} = MLP ([ϕ (x_{i}); g]) where g = j max ϕ (x_{j})$

Derivations

Permutation invariance of PointNet. Let $π$ be any permutation of ${1, \dots, N}$ . Then $f (π (P)) = γ (max_{i} h (x_{π (i)})) = γ (max_{i} h (x_{i}))$ because $max$ over a finite set is invariant under reordering. Therefore the network's output depends only on the *set* ${x_{1}, \dots, x_{N}}$ , not its arrangement.

Universal approximation sketch. Any continuous Hausdorff-symmetric set function $f$ can be approximated as $γ (max_{i} h (x_{i}))$ for sufficiently expressive $h$ and $γ$ . Intuition: the max-pool produces a sparse "critical-points" subset of the input that determines the output; varying these points moves the output continuously; making $h$ 's hidden dimension large enough makes the family dense in the space of continuous symmetric functions. Full proof in Qi et al. (2017).

Why MeshCNN's symmetric inputs are invariant to the 2-way ordering ambiguity. If we swap $(a, c) \leftrightarrow (c, a)$ , the channel $a + c$ is unchanged and $∣ a - c ∣ = ∣ c - a ∣$ also unchanged. Same for $(b, d)$ . The kernel sees the same 5 numbers regardless of starting face. Hence the conv is well-defined despite the topological ambiguity.

FPS gives evenly-spaced anchors. At step $k$ , we pick the point in $P$ that maximises its minimum distance to the chosen set $S_{k}$ . This greedily fills the largest "hole" left in the cloud, so after $K$ steps the anchors cover the surface roughly uniformly — better than random sampling, which over-samples dense regions and misses sparse ones.

VoxNet's cubic memory cost. A grid of side $N$ has $N^{3}$ voxels; storing 4 bytes each gives $4 N^{3}$ bytes. At $N = 64$ : 1 MB per object; at $N = 256$ : 64 MB; at $N = 1024$ : 4 GB. The exponent is what kills you, not the constant.

Examples

Voxel-grid memory worked example. $N = 64$ , 4 B/voxel → $6 4^{3} \times 4 = 1, 048, 576$ B = 1 MB per object, ~99% zeros. Doubling to $N = 128$ multiplies by 8 (8 MB); $N = 256$ → 64 MB; $N = 512$ → 512 MB. Cubic blow-up makes high-resolution voxel inference infeasible.
DGCNN dynamic-graph illustration. Take a 1024-point airplane. *Layer 1:* neighbours of a wing-tip point are the other wing-tip points in geometric space (a few cm away). *Layer 4:* the kNN is recomputed in 256-d feature space — the new neighbours of that wing-tip include *all* wing-tip points across both wings, plus the tail, because they share the "thin trailing edge" semantic. The graph has gone from geometric to semantic.
MeshCNN edge collapse worked example. Edges of a chair-back mesh; the network learns low activations for edges *inside* the flat back panel (uniform geometry, redundant) and high activations for edges along the silhouette and joints. Pooling collapses the interior edges first, halving the mesh size while preserving outline.
Permutation test. Shuffle the same 1024-point chair point cloud 10 times → feed to PointNet → all 10 logit vectors are identical (up to floating-point error). Repeat with a flattened-MLP baseline → 10 different logit vectors. This is the empirical demonstration of symmetry.
Three tasks output shapes. Cloud of 1024 points, 13 ScanNet classes. *Classification* head: $13$ scores. *Part segmentation* head: $1024 \times 13$ score matrix. *Semantic segmentation* (scene): same $1024 \times 13$ , but trained on whole rooms not isolated objects.
Critical-points attack. Mask out the ~50 critical points of a chair (the points whose features survived max-pool); PointNet's prediction collapses (~accuracy drops to chance). Random masking of 200 *non-critical* points barely changes the prediction. The robustness profile is highly asymmetric.

Diagrams

PointNet architecture diagram. $N \times 3$ input → input T-Net ( $3 \times 3$ alignment) → shared MLP ( $N \times 64$ ) → feature T-Net ( $64 \times 64$ ) → shared MLP ( $N \times 1024$ ) → max-pool over $N$ → $1024$ -d global feature → classification MLP → $K$ logits. Segmentation branch: tile the $1024$ -d global feature back to every point, concat with the $N \times 64$ intermediate features → shared MLP → $N \times m$ part scores.
EdgeConv block. Centre point $x_{i}$ with $k = 20$ nearest neighbours $x_{j_{1}}, \dots, x_{j_{k}}$ ; each edge produces a feature $h (x_{i}, x_{j} - x_{i})$ ; symmetric (max) aggregation back to $x_{i}^{'}$ .
Dynamic-graph evolution. Three side-by-side panels — layer 1, layer 3, layer 5 — showing the neighbour-set of one wing-tip point shrinking spatially but growing semantically.
MeshCNN edge neighbourhood. Centre edge $e$ between faces $F_{1}$ and $F_{2}$ ; the four neighbour edges $a, b$ (in $F_{1}$ ) and $c, d$ (in $F_{2}$ ). Annotate the 5-D feature components ( $α_{dih}$ , $α_{a}$ , $α_{b}$ , length ratios).
Family taxonomy diagram. Tree: 3D representations → {view-based: MVCNN; volumetric: VoxNet; point: PointNet/PointNet++; graph: DGCNN; mesh: MeshCNN; fuzzy explicit: 3DGS}.

Edge cases

VoxNet at non-axis-aligned objects. Quantisation artefacts worse when the object's principal axis is at 45° to the grid; pre-aligning with PCA helps.
PointNet vulnerability to adversarial point removal. Targeted removal of *critical points* collapses recognition; random dropout of even 50% of *non-critical* points is harmless.
**DGCNN's $O (N^{2})$ memory.** Recomputing kNN at every layer requires an $N \times N$ distance matrix in feature space; for $N > 50$ k this is infeasible without farthest-point sub-sampling.
Density variation. PointNet++ ball-query handles non-uniform density better than plain kNN; DGCNN's fixed- $k$ struggles when the cloud has both very dense and very sparse regions.
Mesh non-manifold edges. MeshCNN assumes every edge has exactly 2 faces — non-manifold meshes (edges with 1 or $\geq 3$ faces) require pre-processing or are silently mishandled.
Rotation sensitivity. PointNet uses a T-Net to align inputs but is not exactly rotation-equivariant; randomly rotating the test cloud drops accuracy unless rotation augmentation is included in training. Mesh intrinsic features (MeshCNN) are exactly rotation-invariant by construction.

Common mistakes

Saying PointNet uses *sum* or *average* pooling — it is max-pool (symmetric AND sparse, which gives the critical-points interpretation).
Confusing PointNet++ (hierarchical PointNet with FPS + ball query) with DGCNN (kNN graph + EdgeConv + dynamic feature-space graph). Different ideas.
Comparing VoxNet at $3 2^{3}$ to PointNet on 1024 points and concluding "PointNet wins" — that comparison is unfair without matching effective resolution.
Treating mesh edges as vertices when explaining MeshCNN. The operator is edge-centric; that's the whole point of the fixed-size 4-edge neighbourhood.
Forgetting that DGCNN rebuilds the graph in feature space, not coordinate space — the *dynamic* in Dynamic GCNN is exactly this.
Citing the *Universal Approximation Theorem* of MLPs in place of the *PointNet universal-symmetric-function theorem* — they're different theorems about different function classes.

Shortcuts

Permutation invariance = shared MLP + symmetric pool. Memorise this as the single pattern that every point-cloud method instantiates.
Dynamic graph = kNN in FEATURE space, recomputed per layer (DGCNN's signature).
MeshCNN intrinsic features are invariant to rigid motion by construction (dihedral and inner angles, length ratios — all intrinsic).
VoxNet wins iff data is dense + naturally volumetric (medical CT, dense CAD). Otherwise reach for PointNet first.
PointNet's 3-step recipe: shared MLP → max → MLP. PointNet++'s extra step: FPS + ball query wrapping PointNet locally.

Proofs / Algorithms

PointNet's universal-approximation theorem (Qi et al., 2017). Any continuous function $f : X \to R$ on a compact set of point sets, invariant to permutation, can be approximated arbitrarily well by $γ (max_{i} h (x_{i}))$ . Proof outline: discretise the point space into $ϵ$ -cells; $h$ acts as an indicator function lifted into a high-dim feature space; the max-pool extracts a sparse "critical points" subset that uniquely encodes the set up to $ϵ$ ; $γ$ post-processes; the family is dense in continuous symmetric functions as $ϵ \to 0$ and $h$ 's width grows.

EdgeConv with relative-offset feature is translation-equivariant. Translating the whole cloud by $t$ : $x_{i} \to x_{i} + t$ , $x_{j} \to x_{j} + t$ . The edge feature $h (x_{i}, x_{j} - x_{i})$ changes by $h$ 's sensitivity to the first argument — to obtain *translation invariance*, DGCNN papers drop the absolute $x_{i}$ inside the edge feature for translation-invariant variants. The relative offset $x_{j} - x_{i}$ is invariant.

MeshCNN's 5-channel symmetric conv handles the orientation ambiguity. Swapping the labels of the two faces re-labels $(a, b, c, d) \to (c, d, a, b)$ . The symmetric channels $a + c$ , $∣ a - c ∣$ , $b + d$ , $∣ b - d ∣$ are unchanged. Hence the same kernel evaluates to the same value regardless of which face is "first".

Computer Vision