Revision Notes/Unit 5 — NeRF & 3D Gaussian Splatting/NeRF & 3DGS — Per-Scene Optimisation and Differentiable Rendering

NeRF & 3DGS — Per-Scene Optimisation and Differentiable Rendering

Intuition

Given 100 phone photos of your living room, can a computer reconstruct the 3D scene and render it from a new viewpoint? NeRF (2020) said yes by training a tiny MLP $f_{θ} (x, y, z, θ, ϕ) \to (RGB, σ)$ and integrating along camera rays — beautiful but slow (hours to train, 1 fps to render). 3D Gaussian Splatting (2023) kept the *fit-pictures-to-a-scene* objective but threw away the MLP, replacing it with a few million explicit 3D Gaussians that get rasterised in real time. There are no neural network weights and no learning across scenes — every new scene is a fresh per-scene *optimisation*.

Explanation

The taxonomy of 3D representations — memorise. *Explicit* representations enumerate the geometry directly: point clouds, meshes, voxel grids. *Implicit* representations encode the geometry as a function: a Signed Distance Function (SDF) $f (x, y, z)$ returns the signed distance to the nearest surface (surface = zero-level set), and a NeRF $f (x, y, z, θ, ϕ)$ returns colour and density at any point for any viewing direction. **3DGS sits in a sixth slot: explicit (like point clouds) but the primitives are *fuzzy* 3D Gaussians instead of hard points.**

NeRF in one paragraph. Train an MLP $f_{θ} (x, y, z, θ, ϕ) \to (RGB, σ)$ . To render a pixel, trace a ray from the camera through that pixel, sample many points along the ray, query the MLP at each, and integrate colour weighted by density (volumetric rendering). Loss: rendered pixel vs ground-truth pixel from your photos. It works beautifully. Three painful properties motivated 3DGS: training is slow (hours to days per scene); rendering is slow (~1 fps, useless for VR); the geometry is locked inside the MLP — you cannot edit it, you cannot extract a mesh, you cannot easily compose two scenes.

3DGS is per-scene optimisation, not learning. Compare parameters: a neural network's weights $[b, w_{1}, w_{2}, \dots]$ are trained on a *dataset* to *generalise* across inputs; 3DGS's parameters $[μ, Σ, α, SH]$ for each Gaussian are fit to *one specific scene's* photos. There is no concept of generalisation — a 3DGS of your living room only works for your living room. Train it on someone else's living room and you get completely different parameters. Exam line: *3DGS involves no neural networks; the parameters are the Gaussians' positions, shapes, opacities, and colours.*

Pre-processing — COLMAP gives a starting point. Before 3DGS runs, your photos go through COLMAP, a classical Structure-from-Motion (SfM) pipeline. COLMAP outputs (1) camera poses (6-DOF position + orientation for each photo) and (2) a sparse point cloud of 3D points recovered via feature matching. Both are necessary: poses tell 3DGS where each photo was taken from; the sparse cloud initialises one Gaussian per point (without it, optimisation has nothing to start from). Exam note: 3DGS is not end-to-end — pose estimation is done classically by COLMAP, before 3DGS begins.

The Three Pillars of 3DGS — the spine of every answer. (1) Scene modelling — how each Gaussian is parameterised. (2) Image formation — how Gaussians render into a 2D image. (3) Optimisation — how the Gaussians get fit to the photos.

Pillar 1 — Scene modelling. Each Gaussian is a fuzzy 3D ellipsoid (an "egg") with four parameter groups. A 3D Gaussian is $G (x) = exp (- \frac{1}{2} (x - μ)^{⊤} Σ^{- 1} (x - μ))$ — the $1/ ((2 π)^{3/2} ∣Σ∣)$ normalisation is dropped because opacity controls magnitude separately.

**Parameter 1 — Mean $μ \in R^{3}$ .** Position in 3D space — three floats.

**Parameter 2 — Covariance $Σ$ (the tricky one).** $Σ$ is a $3 \times 3$ symmetric matrix with 6 independent values. The problem: $Σ$ must be positive semi-definite (PSD) to be a valid covariance; directly optimising the 6 entries by SGD has no guarantee of preserving PSD-ness — a single bad step and the Gaussian *inverts* or collapses. The fix: factorise $Σ = R S S^{⊤} R^{⊤}$ , where $S$ is a diagonal scale matrix (positive entries) and $R$ is a rotation matrix. This is PSD by construction. So instead of optimising 6 raw entries, optimise: scale $S$ — 3 values $(s_{x}, s_{y}, s_{z})$ , parameterised as $s = exp (s_{raw})$ to guarantee positivity; rotation $R$ — a quaternion $(w, x, y, z)$ , 4 values. Quaternions over Euler angles because Euler angles suffer from gimbal lock and discontinuities; quaternions give smooth, well-conditioned rotations. Total for $Σ$ : $3 + 4 = 7$ params (one more than the 6 of direct optimisation, but PSD-safe).

**Parameter 3 — Opacity $α \in [0, 1]$ .** One scalar; stored raw and sigmoid'd to map into $[0, 1]$ . $α = 1$ → solid; $α = 0$ → invisible.

Parameter 4 — Colour via spherical harmonics (48 params). Plain RGB doesn't work because real surfaces are *view-dependent*: a chrome ball looks white from one angle, blue from another. Solution — spherical harmonics (SH), an orthonormal basis on the unit sphere (Fourier-like). 3DGS uses SH of degree 3, giving $(L + 1)^{2} = 16$ coefficients per channel, $\times 3$ channels (R, G, B) = 48 parameters per Gaussian. At render time, the SH series is evaluated at the actual view direction to get the colour for that viewpoint.

**Total per-Gaussian parameter count: $3 + 7 + 1 + 48 = 59$ .** A typical scene uses 1M–4M Gaussians → 60M–240M parameters total. Big, but smaller than many neural networks, and rasterised in real time.

Pillar 2 — Image formation (rendering). Given the Gaussians and a camera pose, the renderer outputs a 2D image in three steps. Step 1: sort Gaussians by depth in camera coordinates (front-to-back). This ordering matters for alpha compositing. Step 2: project each 3D Gaussian onto the camera plane. A 3D Gaussian projects to a 2D Gaussian; the mean projects via standard pinhole projection, the 3D covariance $Σ$ projects to a 2D covariance via the Jacobian of the projection (a linearisation of perspective at that point). Step 3: alpha-composite the 2D Gaussians front-to-back.

The rendering equation — memorise verbatim. For each pixel, $C (pixel) = \sum_{i = 1}^{N} c_{i} α_{i} \prod_{j = 1}^{i - 1} (1 - α_{j})$ . Here $c_{i}$ is the colour of Gaussian $i$ (evaluated from its SH at the current view direction); $α_{i}$ is its *effective opacity* at this pixel — the stored $α$ modulated by the 2D Gaussian's value at the pixel's location (so the centre of the projection has high effective $α$ , the periphery has low); the product $\prod_{j < i} (1 - α_{j})$ is the transmittance — the probability that light has passed through everything in front without being absorbed. When transmittance drops below a threshold (e.g. $1 0^{- 4}$ ), iteration stops — the optimisation that makes rendering fast.

Differentiable rasterisation — the key engineering. Every operation (projection, sorting, alpha compositing) is differentiable. The 3DGS team built a differentiable tile-based rasteriser in CUDA that does this in parallel on the GPU. Gradients of pixel colour flow back to every Gaussian's parameters — $μ, Σ, α, SH$ . This is what enables optimisation by gradient descent.

Pillar 3 — Optimisation. We have a render → photo → loss → backprop loop, but you don't know *how many* Gaussians the scene needs or *where* they should be. COLMAP gives thousands; the final scene needs millions. So optimisation must adaptively grow and prune the set.

Adaptive Density Control (ADC). Periodically (every few hundred iters), three operations modify the Gaussian set. (1) Densify & clone — for *small* Gaussians in *under-reconstructed* regions (high positional gradient = the optimiser wants to move the Gaussian = one Gaussian isn't enough), clone the Gaussian (duplicate it) and let optimisation pull the copies apart. (2) Densify & split — for *large* Gaussians in *over-reconstructed* regions (also high gradient, but the Gaussian is too big), split into two smaller Gaussians placed at sampled positions inside the original ellipsoid, with reduced scale (typically scale / 1.6). (3) Prune — remove transparent Gaussians ( $α$ below threshold, e.g. $α < 0.005$ ). They contribute nothing but cost compute.

Why does ADC need gradients? Two reasons. *(a)* To decide *which* Gaussians to densify — gradients tell you which are "trying to do too much" (high positional gradient = unsatisfied at current location/size). *(b)* To decide the *direction* of the new Gaussian in densify-and-clone — clone in the gradient's direction so the copies cover the under-reconstructed region.

Loss function. $L = (1 - λ) L_{1} + λ L_{D -SSIM}$ , with $L_{D -SSIM} = (1 - SSIM) /2$ and $λ \approx 0.2$ . $L_{1}$ gives per-pixel absolute difference (crisp details); $L_{D -SSIM}$ uses the Structural Similarity Index, which compares local windows accounting for luminance, contrast, and structure (perceptual quality). The mix gives the best of both.

The full algorithm in 8 steps. (1) Initialise Gaussians from COLMAP sparse cloud. (2) Pick a random training photo. (3) Render the scene from that photo's camera pose using the differentiable rasteriser. (4) Compute loss vs ground-truth. (5) Backprop. (6) Update parameters with Adam. (7) Periodically run ADC: densify-clone / densify-split / prune. (8) Repeat for ~30 000 iterations.

Evaluation — three metrics. Every 8th training image is held out as a validation view. PSNR (Peak Signal-to-Noise Ratio) $= 10 lo g_{10} (R^{2} / MSE)$ where $R$ = max pixel value (255 for 8-bit). Higher is better. Unbounded; ~30 dB is good. SSIM in $[- 1, 1]$ , higher better, kernel over image, captures luminance + contrast + structure. LPIPS (Learned Perceptual Image Patch Similarity), $\geq 0$ , lower better: divide both images into patches, run through a pretrained network (AlexNet/VGG), measure distance in deep feature space. Captures "do these look the same to a human" much better than pixel L2.

Standard benchmarks. Tanks & Temples (outdoor scenes), Deep Blending (diverse scenes), Mip-NeRF 360 (the standard NeRF benchmark, 360° captures). Newer: DL3DV-10K.

3DGS vs NeRF — the comparison. Representation: implicit MLP vs explicit Gaussians. Neural network? Yes (small MLP) vs none. Training time: hours–days vs tens of minutes. Rendering: ~1 fps vs $> 100$ fps. Editable? Hard vs yes (edit individual Gaussians). Quality on static scenes: high vs comparable or better. Storage: ~MB (MLP weights) vs ~GB (millions of Gaussians). The trade-off: 3DGS pays more memory for dramatically faster rendering, faster training, and editability.

Definitions

Novel view synthesis — Given $N$ photos of a scene, render the scene from a new camera pose not in the original set.
Explicit vs implicit 3D representation — Explicit = enumerate primitives directly (points, mesh, voxels, Gaussians). Implicit = encode as a function (SDF, NeRF MLP). 3DGS is explicit-but-fuzzy: explicit primitives that are 3D Gaussians, not hard points.
Signed Distance Function (SDF) — Implicit representation $f (x, y, z)$ returning the signed distance to the nearest surface; surface = zero-level set.
NeRF — Neural Radiance Field: MLP $f_{θ} (x, y, z, θ, ϕ) \to (RGB, σ)$ ; render by volumetric integration along camera rays.
COLMAP — Open-source Structure-from-Motion + multi-view-stereo pipeline. For 3DGS: provides camera intrinsics, extrinsics (poses), and a sparse point cloud for initialisation. Pose estimation in 3DGS is classical, not learned.
Spherical harmonics (SH) — Orthonormal angular basis on the unit sphere. Degree $L$ has $(L + 1)^{2}$ functions. 3DGS uses $L = 3$ → 16 per channel → 48 for RGB. Captures view-dependent appearance (specular highlights).
Adaptive Density Control (ADC) — Three operations during optimisation: Clone (small Gaussian, high pos gradient), Split (large Gaussian, high pos gradient → two smaller, scale / 1.6), Prune (low-opacity Gaussian).
Differentiable rasterisation — Tile-based GPU rasteriser whose every step (sort, project, composite) is differentiable, so pixel gradients flow back to $μ, Σ, α, SH$ .
Transmittance — $T_{i} = \prod_{j = 1}^{i - 1} (1 - α_{j})$ — probability that light has passed through everything in front of Gaussian $i$ without being absorbed.
PSNR — Peak Signal-to-Noise Ratio = $10 lo g_{10} (R^{2} / MSE)$ where $R = 255$ for 8-bit. Higher better; unbounded; ~30 dB good for 8-bit.
SSIM — Structural Similarity Index, $[- 1, 1]$ , higher better. Sliding-kernel comparison capturing luminance, contrast, structure.
LPIPS — Learned Perceptual Image Patch Similarity. Distance in the feature space of a pretrained AlexNet/VGG; lower better; captures human perception.

Formulas

$NeRF MLP: f_{θ} (x, y, z, θ, ϕ) \to (RGB, σ)$
$3D Gaussian: G (x) = exp (- \frac{1}{2} (x - μ)^{⊤} Σ^{- 1} (x - μ))$
$Σ = R S S^{⊤} R^{⊤} (PSD by construction)$
$Per-Gaussian params = μ 3 + 3 scale + 4 quaternion 7 + α 1 + SH deg 3, 16 \times 3 48 = 59$
$SH coefficients per channel: (L + 1)^{2} = 16 for L = 3$
$C (pixel) = i = 1 \sum N c_{i} α_{i} j = 1 \prod i - 1 (1 - α_{j}) (alpha compositing)$
$L = (1 - λ) L_{1} + λ L_{D -SSIM}, L_{D -SSIM} = \frac{1 - SSIM}{2}, λ \approx 0.2$
$PSNR = 10 lo g_{10} (\frac{R ^{2}}{MSE}), R = 255 for 8-bit images$
$s = exp (s_{raw}) (positivity-safe scale parameterisation)$
$NeRF volumetric rendering: C (r) = \int_{t_{n}}^{t_{f}} T (t) σ (r (t)) c (r (t), d) d t, T (t) = exp (- \int_{t_{n}}^{t} σ (s) d s)$

Derivations

**Why decompose $Σ = R S S^{⊤} R^{⊤}$ .** A $3 \times 3$ symmetric matrix has 6 free parameters; for it to be a valid covariance it must be PSD, i.e. all eigenvalues $\geq 0$ . SGD on the 6 raw entries can produce negative eigenvalues at any step. The decomposition gives 7 free parameters: 3 in the diagonal $S$ (parameterised as $exp (s_{raw})$ for positivity) and 4 in a unit quaternion for $R$ . For any vector $v$ : $v^{⊤} Σ v = v^{⊤} R S S^{⊤} R^{⊤} v = ∥ S^{⊤} R^{⊤} v ∥^{2} \geq 0$ . PSD by construction.

Why a quaternion (4 params), not Euler angles (3)? Euler angles ( $ϕ, θ, ψ$ ) are 3 params but suffer from gimbal lock (loss of one DOF when two axes align) and are discontinuous (small rotations near a pole produce large parameter jumps). Quaternions are 4 unit-norm parameters $(w, x, y, z)$ describing 3D rotations smoothly and without singularities; the price is one extra parameter and a unit-norm normalisation step.

**Why the normalisation constant of $G$ is dropped.** The standard Gaussian density carries $1/ ((2 π)^{3/2} ∣Σ∣)$ . In 3DGS, magnitude is controlled separately by opacity $α$ , so the constant is redundant; dropping it removes a costly $∣Σ∣$ computation per Gaussian per pixel.

SH degree 3 → 16 coefficients per channel. Spherical harmonics of order $L$ have $2 L + 1$ functions; total up to degree $L$ is $\sum_{ℓ = 0}^{L} (2 ℓ + 1) = (L + 1)^{2}$ . For $L = 3$ : $1 + 3 + 5 + 7 = 16$ . Times 3 colour channels = 48 SH parameters per Gaussian.

The transmittance term in the rendering equation. Reading $\prod_{j = 1}^{i - 1} (1 - α_{j})$ as the probability that light has passed through Gaussians $1, \dots, i - 1$ without being absorbed at each. After $i - 1$ semi-transparent layers, the remaining "budget" of light reaches Gaussian $i$ ; that budget multiplied by $i$ 's effective $α$ and colour is its contribution. As more opaque Gaussians appear in front, transmittance decays towards zero — discrete-form volumetric rendering.

PSNR is monotone in MSE. PSNR $= 10 lo g_{10} (R^{2} / MSE)$ . Differentiating with respect to MSE gives $- 10/ (MSE \cdot ln 10) < 0$ , so PSNR decreases monotonically as MSE rises. Two reconstructions can be ranked by either equivalently.

Examples

Per-Gaussian parameter breakdown for one Gaussian. $μ = (x, y, z)$ → 3. Scale $S = diag (s_{x}, s_{y}, s_{z})$ → 3. Rotation quaternion $(w, x, y, z)$ → 4. Opacity $α$ → 1. SH: degree 3 → 16 per channel × 3 channels → 48. Total: 3 + 3 + 4 + 1 + 48 = 59.
Scene parameter count. 1M Gaussians × 59 params/Gaussian = 59M parameters — comparable to a small ConvNet but describing this one scene only. 4M Gaussians (typical Mip-NeRF 360 quality) → 236M params, ~1–2 GB on disk.
ADC split-vs-clone decision. Iteration 5 000. Gaussian $g_{a}$ : small ( $∥ S ∥ = 0.01$ m), gradient on $μ$ is large ( $∥ \nabla_{μ} L ∥ > τ$ ) → *under-reconstructed region* → clone $g_{a}$ in the gradient direction. Gaussian $g_{b}$ : large ( $∥ S ∥ = 0.5$ m), gradient on $μ$ also large → *over-reconstructed* → split into two smaller Gaussians with $S /1.6$ , placed at sampled points inside $g_{b}$ 's ellipsoid.
Pruning step. After iteration 10 000, list all Gaussians with $α < 0.005$ (post-sigmoid). Delete them. Typically removes a few percent per pass; total Gaussian count grows over training with periodic dips at pruning.
PSNR worked example. Two renderers on the same scene: A gets MSE = 25 → PSNR = $10 lo g_{10} (25 5^{2} /25) = 34.16$ dB. B gets MSE = 100 → PSNR = $10 lo g_{10} (25 5^{2} /100) = 28.13$ dB. A is better (higher PSNR, lower MSE — consistent ordering).
The three-metric reading. A 3DGS reconstruction of an outdoor scene reports PSNR = 27.5 dB, SSIM = 0.88, LPIPS = 0.18. PSNR is moderate (some pixel error); SSIM is high (local structure preserved); LPIPS is low (perceptually close). For VR / asset reuse, all three should be reported; one alone can mislead.

Diagrams

3DGS pipeline. Photos → COLMAP (camera poses + sparse point cloud) → initialise one Gaussian per sparse point → loop: pick random photo → differentiable rasterise → compute $L_{1} + L_{D -SSIM}$ → backprop → Adam update → every $N$ iters run ADC (clone / split / prune) → after 30 000 iters output the optimised Gaussian set.
Alpha-compositing illustration. Three sorted Gaussians at depths $d_{1} < d_{2} < d_{3}$ along a ray; show transmittance $T_{1} = 1$ , $T_{2} = (1 - α_{1})$ , $T_{3} = (1 - α_{1}) (1 - α_{2})$ ; pixel colour = $c_{1} α_{1} + c_{2} α_{2} T_{2} + c_{3} α_{3} T_{3}$ .
PSD parameterisation visual. Axis-aligned ellipsoid from $S = diag (s_{x}, s_{y}, s_{z})$ → rotate by quaternion $R$ → oriented ellipsoid (the actual 3D Gaussian shape).
ADC operations. *Clone*: a small under-reconstructed Gaussian splits into two identical copies, then the gradient pulls them apart. *Split*: a large over-reconstructed Gaussian becomes two smaller Gaussians ( $S /1.6$ ) at sampled positions inside it. *Prune*: a translucent Gaussian (low $α$ ) is deleted.
NeRF vs 3DGS rendering. NeRF: ray marching with MLP queries at each sample (expensive). 3DGS: rasterise sorted projected Gaussians (cheap, parallelisable on GPU tiles).

Edge cases

No COLMAP init. Random init wastes early optimisation; many runs fail to converge or produce blurry geometry. COLMAP is effectively required.
Over-fitting individual training views when test views are sparse. ADC pruning + periodic opacity reset are the safety valves.
Very view-dependent materials (mirror finishes, anisotropic BRDFs) saturate SH degree 3. Higher SH degree or learned BRDFs are needed for those.
Floaters — semi-transparent Gaussians that exist in empty space because they happened to help one training view. ADC pruning and depth regularisation reduce them.
Reflective / non-Lambertian surfaces look fine from training viewpoints but break under novel views; this is a known failure mode shared with NeRF.
Quaternion drift. After many gradient steps, $(w, x, y, z)$ drifts off the unit sphere; renormalisation every step (or every $k$ steps) is required for $R$ to remain a valid rotation.

Common mistakes

Claiming 3DGS *learns* weights that generalise across scenes — no; it's per-scene optimisation. Each new scene is a fresh fit, parameters are not reusable.
Forgetting to sort by depth before alpha compositing — composite order matters; unsorted compositing produces incorrect transmittance and colour.
Optimising $Σ$ as a free $3 \times 3$ matrix — produces invalid (non-PSD) covariances mid-training; always use the $R S S^{⊤} R^{⊤}$ decomposition.
Stating PSNR is bounded — it isn't. Better reconstructions can have arbitrarily large PSNR (limited only by the numerical precision of MSE).
Picking 3 quaternion parameters because "3D rotation has 3 DOF" — the unit-norm constraint absorbs one DOF, so quaternions use 4 parameters. Euler angles are 3 but suffer from gimbal lock.
Counting 3 + 6 + 1 + 48 = 58 by using direct $Σ$ parameters — 3DGS uses the 7-param $R S$ decomposition, giving 59 per Gaussian.
Saying "3DGS is end-to-end" — it isn't. Camera-pose estimation is classical (COLMAP) and happens before 3DGS begins.
Forgetting that the normalisation constant of the Gaussian density is *dropped* in 3DGS because opacity controls magnitude separately.

Shortcuts

Per-Gaussian param count: $3 + 7 + 1 + 48 = 59$ . Practice writing the breakdown.
** $Σ$ decomposition:** $Σ = R S S^{⊤} R^{⊤}$ , where $R$ is a quaternion (4 params) and $S = diag (exp (s_{raw}))$ (3 params, positive).
**SH degree 3 → $(L + 1)^{2} = 16$ coefficients per channel → 48 total.**
Loss mix: $λ \approx 0.2$ on $L_{D -SSIM}$ .
Metrics direction: PSNR ↑, SSIM ↑, LPIPS ↓. Memorise.
Three pillars: *Scene modelling → Image formation → Optimisation.*
Three rendering steps: *Sort by depth → project 3D→2D → alpha-composite.*

Proofs / Algorithms

** $Σ = R S S^{⊤} R^{⊤}$ is PSD.** For any $v \in R^{3}$ : $v^{⊤} Σ v = v^{⊤} R S S^{⊤} R^{⊤} v = (S^{⊤} R^{⊤} v)^{⊤} (S^{⊤} R^{⊤} v) = ∥ S^{⊤} R^{⊤} v ∥^{2} \geq 0$ . Equality iff $S^{⊤} R^{⊤} v = 0$ , i.e. $v$ is in the null space of $S$ . PSD by construction.

The rendering equation is differentiable end-to-end. Each term $c_{i} α_{i} \prod_{j < i} (1 - α_{j})$ is a product of smooth functions of $(μ, Σ, α, SH)$ (projection is smooth in $μ, Σ$ ; effective $α_{i}$ is smooth in $Σ$ and the stored $α$ ; $c_{i}$ is linear in SH). The sum-of-products preserves smoothness. Therefore $\partial C / \partial (any param)$ exists and is computable analytically — exactly what the CUDA rasteriser provides.

PSNR is monotone in MSE. $\partial PSNR / \partial MSE = - 10/ (MSE \cdot ln 10) < 0$ . So PSNR strictly decreases as MSE increases, and the two metrics produce identical rankings of reconstructions.

Computer Vision