Revision Notes/Unit 5 — NeRF & 3D Gaussian Splatting/NeRF & 3DGS — Per-Scene Optimisation and Differentiable Rendering/Story

NeRF & 3DGS — Per-Scene Optimisation and Differentiable Rendering

NotesStory

Unit 5 — NeRF & 3D Gaussian Splatting

Splatters of Light

Imagine you have 100 photos of your living room — taken from different angles with your phone. Could a computer reconstruct the entire 3D scene from those photos? And could it then render the scene from any new viewpoint that wasn't in your original set?

This is the novel view synthesis problem, and until 2020 it was hard. Then NeRF (Neural Radiance Fields, Mildenhall et al., ECCV 2020) made it work — but slowly, training for hours per scene and rendering at 1 fps. In 2023, 3D Gaussian Splatting (Kerbl et al., SIGGRAPH 2023) broke the speed barrier: comparable quality, but real-time rendering ( $> 100$ fps) and an order of magnitude faster training.

3DGS is genuinely shocking because it's so *different* from everything else in your course. There is no neural network. There is no learning. It is a per-scene optimisation problem. The "weights" being optimised are the parameters of a few million little 3D blobs that, when rendered, look like the photos. That's it.

The taxonomy

Two ways to represent 3D, plus a third that 3DGS pioneers:

Explicit — directly enumerate the geometry. Point clouds, meshes (vertices + faces), voxels.
Implicit — encode the geometry as a function. SDF $f (x, y, z) \to$ signed distance to the nearest surface (surface = zero-level set). NeRF $f (x, y, z, θ, ϕ) \to (RGB, σ)$ , a small MLP.
3DGS sits in a sixth slot — explicit (like point clouds), but each "point" is a fuzzy 3D Gaussian rather than a hard point.

Why NeRF was the breakthrough — and why people wanted something else

NeRF works like this. Train an MLP $f_{θ} (x, y, z, θ, ϕ) \to (RGB, σ)$ . To render a pixel: trace a ray from the camera through that pixel, sample many points along the ray, query the MLP at each, and integrate colour weighted by density (volumetric rendering). Loss: rendered pixel vs ground-truth pixel from your photos.

It works beautifully. But three painful properties:

1. Slow training — hours to days per scene. 2. Slow rendering — 1 fps. Useless for VR or interactive apps. 3. The geometry is locked inside an MLP — you can't edit it, you can't extract a mesh, you can't easily compose multiple scenes.

3DGS keeps NeRF's *fit-pictures-to-a-scene* objective but throws away the MLP. Real-time. Editable. No neural network.

"Learning or Optimisation?"

The lecturer asks this question explicitly and answers it: 3DGS is per-scene optimisation. There are no neural networks involved.

Compare:

Parameters in a neural network: $[b, w_{1}, w_{2}, w_{3}, \dots]$

Parameters in 3DGS: $[μ, Σ, α, SH]$ for each Gaussian

When you "train" a CNN, you optimise weights against a dataset of many images and hope they generalise. When you "train" 3DGS, you optimise the parameters of a few million Gaussians against *one specific scene's photos*. There's no concept of generalisation — a 3DGS for your living room only works for your living room. Train again on someone else's living room → totally different parameters.

COLMAP — the pre-step that's not 3DGS

Before 3DGS runs, you feed your photos through COLMAP, a classical Structure-from-Motion pipeline. COLMAP outputs:

1. Camera poses — for each photo, the 6-DOF position and orientation when it was taken. 2. A sparse point cloud — a small set of 3D points recovered from feature matching.

Both are necessary. Camera poses tell 3DGS where each photo was taken from. The sparse cloud seeds the optimisation — one Gaussian per sparse point. Without COLMAP's init, 3DGS has nothing to start from.

Exam note: 3DGS is *not end-to-end*. Pose estimation is done classically, before optimisation begins.

The Three Pillars

Memorise these as the spine of every answer.

1. Scene Modelling — how the scene is parameterised (what each Gaussian *is*). 2. Image Formation — how Gaussians get rendered into a 2D image. 3. Optimisation — how the Gaussians get fit to the photos.

Pillar 1 — Scene Modelling

A 1D Gaussian is $G (x) = exp (- (x - μ)^{2} /2 σ^{2})$ . The multivariate 3D version is

G (x) = exp (- \frac{1}{2} (x - μ)^{⊤} Σ^{- 1} (x - μ))

with $μ \in R^{3}$ the centre and $Σ \in R^{3 \times 3}$ the covariance. The normalisation $1/ ((2 π)^{3/2} ∣Σ∣)$ is dropped in 3DGS because opacity controls magnitude separately.

Visualise each Gaussian as a fuzzy 3D ellipsoid — an "egg" — that fades smoothly from solid centre to transparent edges.

**Parameter 1 — Mean $μ$ (3 params).** Position. Three floats.

**Parameter 2 — Covariance $Σ$ (the tricky one).** Symmetric $3 \times 3$ → 6 independent values. *Problem:* $Σ$ must be positive semi-definite to be a valid covariance; SGD on the 6 raw entries can break PSD-ness mid-training. *Fix — factorise:*

Σ = R S S^{⊤} R^{⊤}

where $S$ is a diagonal positive scale matrix and $R$ is a rotation matrix. PSD by construction. Optimise $S$ and $R$ separately:

Scale — 3 values $(s_{x}, s_{y}, s_{z})$ . Stored as $exp (s_{raw})$ so scales are always positive (negative raw values handled by the exp activation).
Rotation — a quaternion $(w, x, y, z)$ . 4 values. *Why not 3 (Euler)?* Euler angles suffer from gimbal lock and discontinuities. Quaternions are smooth, well-conditioned 3D rotations; normalise the 4-vector to unit length after each step.

So $Σ$ via $R$ and $S$ = $3 + 4 = 7$ params (one more than the 6 of direct optimisation, but always valid).

**Parameter 3 — Opacity $α$ (1 param).** Stored raw, sigmoid'd into $[0, 1]$ . $α = 1$ solid; $α = 0$ invisible.

Parameter 4 — Colour via spherical harmonics (48 params). Why not plain RGB? Because real surfaces are *view-dependent* — a shiny chrome ball looks white from one angle, blue from another, red from a third. Solution: spherical harmonics, a basis on the unit sphere that approximates any function over viewing directions (Fourier-like, but on the sphere).

3DGS uses SH of degree 3. Number of SH coefficients for degree $L$ is $(L + 1)^{2} = 16$ . Per colour channel (R, G, B), so $16 \times 3 = 48$ parameters total per Gaussian. At render time, evaluate the SH series at the current view direction to get the colour you actually see.

Total per Gaussian: $3 + 7 + 1 + 48 = 59$ . A typical scene uses 1M–4M Gaussians → 60M–240M parameters. Big, but still smaller than many neural networks, and rendered in real time.

Pillar 2 — Image Formation (Rendering)

Given the Gaussians and a camera pose, render the 2D image in three steps:

1. Sort Gaussians by depth. Compute each Gaussian's $z$ in camera coordinates. Sort front-to-back. Order matters for alpha compositing. 2. Project each 3D Gaussian onto the camera plane. The mean projects via standard pinhole projection. The 3D covariance projects to a 2D covariance via the Jacobian of the projection (linearisation of perspective at that point). 3. Alpha-composite the projected 2D Gaussians, front-to-back.

The rendering equation — memorise verbatim:

C (pixel) = i = 1 \sum N c_{i} α_{i} j = 1 \prod i - 1 (1 - α_{j})

$c_{i}$ — colour of Gaussian $i$ (evaluated from its SH at the current view direction).
$α_{i}$ — *effective* opacity at this pixel: the Gaussian's stored $α$ , modulated by the value of the 2D-projected Gaussian at this pixel's location (centre of projection → high effective $α$ ; periphery → low).
$\prod_{j < i} (1 - α_{j})$ — transmittance: the probability that light has passed through all the Gaussians in front of $i$ without being absorbed. When transmittance drops below ~ $1 0^{- 4}$ , the loop stops — further Gaussians can't contribute.

Differentiable rasterisation — the engineering punchline. Every step (projection, sorting, alpha compositing) is differentiable. The 3DGS team built a differentiable tile-based rasteriser in CUDA. Gradients of pixel colour flow back to *every Gaussian's parameters* — $μ$ , $Σ$ , $α$ , SH. This is what enables optimisation by gradient descent.

Pillar 3 — Optimisation

We have render → photo → loss → backprop. But you don't know *how many* Gaussians the scene needs, or *where* they should be. COLMAP gives thousands of sparse points; the final scene needs millions of Gaussians. So optimisation must adaptively grow and prune the set as it learns.

Adaptive Density Control (ADC)

Periodically (every few hundred iters), three operations modify the Gaussian set:

1. Densify and clone — for *small* Gaussians sitting in *under-reconstructed* regions (high positional gradient = the optimiser wants to move the Gaussian = one Gaussian isn't enough). Clone the Gaussian, then let optimisation pull the copies apart in the gradient direction. 2. Densify and split — for *large* Gaussians in *over-reconstructed* regions (also high gradient, but the Gaussian is too big). Split into two smaller Gaussians at sampled positions inside the original ellipsoid, with scale reduced (typically scale / 1.6). 3. Pruning — remove transparent Gaussians ( $α < 0.005$ ). They contribute nothing but cost compute.

Why does ADC need gradients? Two reasons. *To decide which Gaussians to densify* — gradients reveal which Gaussians are "trying to do too much" (high positional gradient = unsatisfied at current location/size). *To decide the direction of the new Gaussian in densify-and-clone* — clone in the gradient direction so the copies cover the under-reconstructed region.

The "evolution of Gaussians over iterations" plot shows the count growing over training (often to millions), with periodic dips where pruning kicks in.

Loss

L = (1 - λ) L_{1} + λ L_{D -SSIM}, L_{D -SSIM} = \frac{1 - SSIM}{2}, λ \approx 0.2

$L_{1}$ — pixel-wise absolute difference. Crisp details.
$L_{D -SSIM}$ — based on SSIM (Structural Similarity Index), comparing local windows for luminance, contrast, and structure. Perceptual fidelity.

The full algorithm in 8 steps

1. Initialise Gaussians from COLMAP sparse cloud. 2. Pick a random training photo. 3. Render the scene from that photo's camera pose (differentiable rasteriser). 4. Compute loss against ground-truth photo. 5. Backprop loss to gradients on all Gaussian parameters. 6. Update parameters with Adam. 7. Periodically run ADC: densify-clone, densify-split, prune. 8. Repeat ~30 000 iterations.

Coarse-to-fine: start sparse, refine and grow until photos match.

Evaluation — three metrics

Every 8th training image is held out as a validation view. Then three metrics compare rendered novel view to held-out photo.

PSNR (Peak Signal-to-Noise Ratio).

PSNR = 10 lo g_{10} (\frac{R ^{2}}{MSE})

$R = 255$ for 8-bit images; MSE is mean squared error. Higher better. Unbounded; ~30 dB is good. 3DGS typically achieves 25–35 dB depending on scene.

SSIM (Structural Similarity Index). Range $[- 1, 1]$ ; higher better; sliding kernel; captures luminance + contrast + structure. 3DGS achieves ~0.85+ on standard benchmarks.

LPIPS (Learned Perceptual Image Patch Similarity). Patches → pretrained network (AlexNet or VGG) → distance in deep feature space. Lower better. Captures human perception much better than pixel L2.

| Metric | Range | Best | Measures | | --- | --- | --- | --- | | PSNR | dB | higher | pixel-level fidelity | | SSIM | $[- 1, 1]$ | higher | local structure | | LPIPS | $\geq 0$ | lower | perceptual similarity |

Memorise the directions — standard exam question.

Standard datasets

Tanks & Temples (outdoor), Deep Blending (diverse), Mip-NeRF 360 (360° captures — the standard NeRF benchmark). Plus the newer DL3DV-10K.

3DGS vs NeRF — the table you must walk in with

| | NeRF | 3DGS | | --- | --- | --- | | Representation | Implicit (MLP) | Explicit (Gaussians) | | Neural network? | Yes (small MLP) | None | | Training time | Hours to days | Tens of minutes | | Rendering | Slow (~1 fps) | Real-time ( $> 100$ fps) | | Editable? | Hard | Yes (edit individual Gaussians) | | Quality on static scenes | High | Comparable or better | | Storage | ~MB (MLP weights) | ~GB (millions of Gaussians) |

The trade-off: 3DGS pays in memory (millions of Gaussian parameters) for dramatically faster rendering and training, plus the ability to edit the scene directly.

What you carry into the exam

Explicit vs implicit 3D representations, with 3DGS as the explicit-but-fuzzy sixth slot. NeRF's three problems that motivated 3DGS. 3DGS = per-scene optimisation, no neural network — the parameters *are* the Gaussians. COLMAP gives camera poses + sparse cloud; 3DGS is not end-to-end. The Three Pillars: scene modelling → image formation → optimisation. Each Gaussian has 59 parameters ( $3 + 7 + 1 + 48$ ). $Σ = R S S^{⊤} R^{⊤}$ is PSD by construction. Exp activation on scales, quaternion (4) over Euler (3). SH degree 3 → $(L + 1)^{2} = 16$ per channel → 48 for RGB. The rendering equation with effective-opacity and transmittance terms. The three rendering steps. Differentiable rasterisation makes the whole thing optimisable. ADC's three operations (clone / split / prune) and why each needs gradients. Loss $(1 - λ) L_{1} + λ L_{D -SSIM}$ , $λ \approx 0.2$ . Three metrics with directions: PSNR ↑, SSIM ↑, LPIPS ↓. Standard datasets. The full NeRF-vs-3DGS table.

3DGS is one of the most consequential vision papers of the last three years — and the bridge between classical 3D reconstruction and modern differentiable rendering.

Computer Vision