ML — Logistic, NN+Backprop, Ensembles, Density, RNN, Metrics, kNN, Regression, PCA/SVD, Clustering
Intuition
ML in this course is the bag of mathematical tools that turn data into a function. Three task families: supervised (X → Y; classification, regression), unsupervised (X alone; clustering, density, dimensionality reduction), reinforcement (X → action with reward; not the focus here). The key tools — logistic regression, the multilayer perceptron with backprop, PCA, k-means — show up inside every modern CV system as building blocks (a softmax head IS multinomial logistic regression; a CNN backbone IS hierarchical MLP with weight sharing).
Explanation
Train / Val / Test. Train: fit weights. Val: tune hyperparameters (model size, learning rate, regularisation strength). Test: report final performance, ONCE. Never tune on test (otherwise test becomes another val and you have no unbiased estimate of generalisation). Curse of dimensionality: in high , distances concentrate, volumes blow up, and naive methods (kNN, KDE) need exponential data to fill the space.
Logistic regression — the canonical classifier. Binary: where . Loss = binary cross-entropy . *Not MSE* — MSE is non-convex under sigmoid and gradients vanish. SGD update: — note the clean form, identical to linear regression's gradient (this is not coincidence: it's the canonical GLM result). Decision boundary is LINEAR in : . Multinomial (softmax) for classes: with cross-entropy loss .
Neural networks. One neuron: . Stack into layers (MLP). Activations: *sigmoid* — saturates both ends, vanishing gradient. *tanh* — zero-centred but still saturates. ReLU — fast, gradient = 1 in positive half, sparse activations, 'dying ReLU' risk. *Leaky ReLU* and *GELU* fix the death problem.
Backpropagation. Chain rule for gradients through a composition. For each parameter, the gradient is the product of local Jacobians along the path from output back to that parameter. Implementations: auto-diff (PyTorch / TF) build a computation graph at forward time, then traverse it backwards.
Why not zero-initialise weights? SYMMETRY. If , every neuron in a layer receives the same input and produces the same output, so backprop drives them with identical gradients → they remain identical forever. The whole layer collapses to one neuron's worth of expressivity. Fix: random init.
Xavier (Glorot) init: — for sigmoid/tanh. Kaiming (He) init: — for ReLU; the factor of 2 compensates for the half of activations zeroed out.
Speed-up tricks (memorise the six). (1) Batch Normalisation — normalise activations to zero-mean unit-variance per mini-batch, then learnable scale-shift. Faster training, higher LR, less init-sensitive, slight regularisation. (2) Higher LR (BN enables this). (3) Dropout — randomly zero fraction of activations during training (typically in FC, – in CNNs). Acts as ensemble + regulariser; at inference, scale by or use inverted dropout. (4) Shuffle data between epochs. (5) Less L2 weight decay when BN is present. (6) LR decay / scheduling — start higher, decay over epochs.
Cross-entropy vs MSE for classification. CE is preferred. (a) CE matches the maximum-likelihood interpretation under Bernoulli/categorical. (b) The sigmoid+CE gradient simplifies to — no vanishing 'derivative of sigmoid' factor. With MSE+sigmoid, the gradient has an extra that vanishes at saturation → slow learning.
Ensembles. Bagging = bootstrap aggregating. Sample with replacement, train independent models, average predictions. Reduces VARIANCE. Random Forest is the canonical example. Boosting = sequential, each model focuses on the previous model's errors via reweighting. Reduces BIAS. AdaBoost is the textbook case; Viola-Jones face detection (2001) used AdaBoost over Haar features for real-time detection. Stacking = train a meta-model on the base models' predictions.
Density estimation. *Histogram* — bin-dependent, not smooth, fails in 2D+. *KDE* — Gaussian kernel per data point, sum and normalise. *Gaussian Mixture Model (GMM)*: with . Trained via EM.
EM for GMM. *E-step*: compute soft responsibilities — posterior that point came from component . *M-step*: update each component by weighted MLE: , , . Iterate. Monotonically improves log-likelihood, converges to a local max.
RNNs. Recurrence ; same shared across time. BPTT unrolls through time and backpropagates. Vanishing / exploding gradients: ; if vanishes, if explodes. Fix vanishing with LSTM/GRU (additive cell-state path → constant error carousel). Fix exploding with gradient clipping.
LSTM. Three gates (input, forget, output) + cell state . , , , , , . The additive path is what gives the constant error carousel.
GRU. Simpler: update gate (merges forget + input) and reset gate . No separate cell state. Often as good as LSTM with fewer parameters.
RNN I/O shapes. 1-to-1 (image classification), 1-to-many (image captioning), many-to-1 (sentiment / video classification), many-to-many synced (per-frame labels), many-to-many shifted (translation / video captioning).
Metrics. Confusion matrix → TP, FP, TN, FN. Precision = — of what I retrieved, how much was right. Recall (TPR) = — of what was right, how much did I retrieve. F1 = — harmonic mean. Specificity (TNR) = . FPR = .
When precision vs recall? Monitor PRECISION when FALSE POSITIVES are costly (spam filter: flagging real mail). Monitor RECALL when FALSE NEGATIVES are costly (cancer screening: missing a real tumour).
AP / mAP. AP = area under the PR curve for one class. mAP = mean over classes (or over queries in retrieval). Compute by sorting by score, walking down the list, marking TP/FP at the IoU threshold, building cumulative P and R, and integrating.
ROC vs PR. ROC plots TPR vs FPR. PR plots Precision vs Recall. For BALANCED data, ROC is informative. For IMBALANCED data with rare positives, ROC looks deceptively good (many TNs in the denominator inflate it); PR is preferred because the precision axis directly tracks rare-positive performance.
kNN. Lazy learner: no training, just store. Predict by majority vote among nearest training points. odd to avoid ties. **Effect of **: small → high variance, low bias (overfits, noisy). Large → low variance, high bias (oversmooths). Distance: Euclidean (L2) — default; Manhattan (L1) — robust; Cosine — directional similarity; Mahalanobis — covariance-aware; Hamming — discrete strings.
Linear regression. . Least squares minimises SSE. Closed-form via normal equations: . Fails when is singular (collinear features) or when is too large to invert — fall back to gradient descent. Linear in COEFFICIENTS, not variables — that's why polynomial regression is still 'linear' (linear in ).
Regularisation. L1 (Lasso) — drives weights to exactly zero, performs feature selection. L2 (Ridge) — shrinks weights smoothly, no exact zeros. Elastic Net combines both.
PCA goal. Find orthogonal directions of maximum variance. Steps: (1) centre data . (2) Covariance . (3) Eigen-decompose . (4) Keep top- eigenvectors → projection . = variance along . Reconstruct via .
SVD. . The right singular vectors = eigenvectors of = principal components. The singular values . Numerically stable (no need to form which squares the condition number). Low-rank approximation: keep top- singular values → best rank- approximation in Frobenius norm (Eckart–Young).
LoRA = low-rank update. where , , . Train only — saves memory / compute when fine-tuning huge models. Spiritually the same trick as SVD/PCA.
Eigenfaces. Apply PCA to face images. Top eigenvectors look like ghostly average faces. Represent a new face as a linear combination of top eigenfaces; classify by nearest-neighbour in this low-dim space.
k-means. (a) Init centres. (b) Assign each point to nearest centre. (c) Recompute centres as mean of assigned points. (d) Repeat until convergence. Issues (memorise four): HARD assignments, favours SPHERICAL equal-size clusters, sensitive to INIT (use k-means++: spread initial centres by sampling proportionally to squared distance from existing centres), sensitive to OUTLIERS (use k-medoids instead).
GMM clustering = soft k-means with shapes. Responsibilities provide fractional membership; covariances allow elongated/rotated clusters.
Hierarchical clustering. Agglomerative (bottom-up): start with each point as a cluster, repeatedly merge closest pair. Divisive (top-down): start with one cluster, repeatedly split. Output: a dendrogram — cut at different heights to get different numbers of clusters. Linkage criteria: single (closest pair), complete (farthest pair), average, Ward (minimum variance).
Definitions
- Sigmoid / softmax — Squash to / probability simplex. Sigmoid for binary, softmax for classes.
- Cross-entropy loss — Information-theoretic distance between distributions. Categorical: . Gradient on logits = — clean and stable.
- Backpropagation — Chain-rule traversal of a computation graph from loss back to parameters. Each node multiplies the local Jacobian.
- Vanishing / exploding gradients — Gradient norm decays to 0 (or blows up) as it propagates back through many layers / time steps. Mitigations: careful init, ReLU/skip connections (forward), LSTM/clip (recurrent).
- Batch Normalisation — Normalise activations per mini-batch to zero-mean unit-variance; then learnable scale and shift. Faster training, less init-sensitive, slight regularisation. At inference: use running mean/var, NOT batch stats.
- Dropout — Stochastically zero fraction of activations during training. Acts as an ensemble + regulariser. Disable (or rescale) at inference.
- Bagging vs Boosting — Bagging: parallel, bootstrap samples, reduces variance (Random Forest). Boosting: sequential, reweight errors, reduces bias (AdaBoost; Viola-Jones face detection).
- Gaussian Mixture Model — . Trained by EM. Soft clustering with covariance-shaped components.
- EM algorithm — E-step: compute responsibilities given current parameters. M-step: re-estimate parameters by weighted MLE. Monotonically improves log-likelihood; converges to a local maximum.
- LSTM — Gated recurrent unit with three gates (forget, input, output) and an additive cell-state path that prevents vanishing gradients (constant error carousel).
- Precision / Recall / F1 — P = TP/(TP+FP), R = TP/(TP+FN), F1 = 2PR/(P+R) (harmonic mean). Use precision when FP costly; recall when FN costly.
- AP / mAP — Average Precision = area under the PR curve for one class. mAP = mean over classes (or queries).
- ROC vs PR curve — ROC: TPR vs FPR. PR: Precision vs Recall. Prefer PR for imbalanced data with rare positives.
- kNN — Lazy learner; predict by majority vote among k nearest training points. Small k → variance; large k → bias. k usually odd.
- Normal equations — Closed-form least squares: . Fails when is singular (collinearity) or too large to invert.
- L1 vs L2 regularisation — L1 = sum of |w|; drives weights to exactly 0 → feature selection. L2 = sum of w²; shrinks smoothly, no exact zeros.
- PCA — Find orthogonal directions of max variance. Steps: centre → covariance → eigendecompose → keep top-k. = variance along .
- SVD — . Right singular vectors are PCA components. Numerically stable. Best rank- approximation in Frobenius norm (Eckart–Young).
- LoRA — Low-Rank Adaptation: with , . Cheap fine-tuning of huge models. Same spirit as SVD.
- k-means — Hard-assignment clustering. Init → assign → update → repeat. Issues: spherical bias, init-sensitive, outlier-sensitive. Fixes: k-means++, GMM, k-medoids.
- Hierarchical clustering — Agglomerative (bottom-up) or divisive (top-down). Produces a dendrogram; cut at different heights for different cluster counts. Linkage: single / complete / average / Ward.
Formulas
Derivations
**Logistic regression SGD = .** Loss with . Use chain rule: . . . Cancellation gives , hence . The cancellation is precisely why CE+sigmoid avoids vanishing-gradient saturation issues.
Why not zero-init. Suppose . Then every neuron in layer 1 outputs the same activation ( same constant). Layer 2's gradients depend only on the layer-1 output, which is the same for all → rows update identically → all neurons remain identical. By induction the network is equivalent to a single-neuron-per-layer network forever.
Bessel-style argument for cross-entropy gradient. With softmax + categorical CE, the gradient on logits is again — no derivative-of-softmax factor survives. Gives stable, scale-invariant updates regardless of the magnitude of logits.
Normal equations. Minimise . Expand: . Set .
PCA via eigendecomposition. Project onto unit vector : variance of projection = . Maximise subject to via Lagrangian: — is an eigenvector, is the variance (the corresponding eigenvalue). Top- eigenvectors maximise sequentially.
SVD low-rank optimum (Eckart–Young). For with singular values , the rank- matrix that minimises is , with error . Same statement holds in the operator norm.
EM monotonically improves likelihood. E-step computes the variational lower bound . M-step maximises . By Jensen's inequality, . Hence likelihood never decreases.
Examples
- Logistic regression decision boundary. Two clusters in 2D. After training, , . Boundary: (a line). Points above predicted class 1, below class 0.
- Backprop on a 2-layer MLP. Compute output on a 2D input by hand; then propagate gradient back through . Practise this with , — the canonical exam numerical.
- Confusion matrix. TP=80, FP=20, FN=10, TN=890. . . .
- AP from a ranked list. 3 GTs, 5 ranked detections [TP, TP, FP, FP, TP]. After each step = . AP ≈ 0.86 (area under staircase).
- PCA on toy 2D data. Centre, compute covariance, eigendecompose. Top eigenvector points along the data's main axis. Projection reduces 2D → 1D with the least variance loss.
- k-means iteration trace. Two clusters in 1D: data . Init centres . Assignment → , . Recompute → . Converged.
- GMM E-step. Two Gaussians with , , equal priors. For : density under very small; under very small. Use ratio for — both responsibilities ≈ 0.5 (point sits between modes).
Diagrams
- MLP block diagram: input → linear → activation → linear → softmax. Annotate forward and backward arrows.
- Sigmoid vs ReLU vs tanh activation curves with gradient curves underneath.
- LSTM cell schematic: three gates, the additive cell-state highway, output gate.
- Confusion matrix template (2×2 grid) with TP/FP/FN/TN labels.
- PR vs ROC curves on the same imbalanced-data toy problem to illustrate why PR is preferred.
- PCA: 2D cloud with two orthogonal arrows (principal axes) overlaid; rotated view = the projection.
- SVD diagram: with , diagonal , as labelled rectangles.
- k-means iteration: 3 panels (init, assign, update) on the same 2D point cloud.
- Dendrogram for hierarchical clustering on 6 points; cut at two different heights → 2 vs 3 clusters.
Edge cases
- MSE on logistic regression — non-convex with sigmoid; gradients vanish at saturation. Use cross-entropy.
- Zero-init kills the network via symmetry — all neurons identical forever.
- Dropout at inference must be turned OFF (or scaled by if using vanilla); inverted dropout scales during training so inference is unchanged.
- BatchNorm at inference uses running mean/var from training, NOT mini-batch statistics. Mismatch is a classic deployment bug.
- Curse of dimensionality breaks kNN and KDE in high — distances concentrate, nearest = farthest.
- LSTM exploding gradients are NOT solved by the gating — clip gradients separately.
- Imbalanced classification + accuracy is meaningless (99% accuracy = always say 'no cancer'). Use Precision/Recall/F1 or AUC-PR.
- Multinomial logistic / softmax is not the same as a stack of binary logistic regressions — softmax models mutual exclusivity, binaries don't.
- PCA assumes linear directions — fails on data lying on curved manifolds (use kernel PCA or t-SNE/UMAP).
- k-means with bad init can converge to a poor local minimum. Use k-means++ initialisation or run multiple restarts.
- Hierarchical clustering with single-linkage can produce 'chains' (long thin clusters); complete or average linkage are more compact.
Common mistakes
- Train/Val/Test confusion. Never tune on test. Val for hyperparams; test ONCE.
- MSE for classification. Use cross-entropy — gradient signal is granular and doesn't vanish.
- Accept H₀ vs fail to reject. (Not in this course but adjacent.) Hypothesis testing never 'accepts' the null.
- Confusing bagging and boosting. Bagging: parallel, independent → variance reduction (RF). Boosting: sequential, reweighting → bias reduction (AdaBoost, Viola-Jones).
- PCA without centring — the first PC will end up pointing at the mean, not the direction of variance. Always centre first.
- SVD vs PCA confusion. PCA = eigendecomposition of covariance. SVD = direct decomposition of . Same answer, SVD is more numerically stable.
- Reporting accuracy on imbalanced data. Useless. Use Precision/Recall/F1 or AP/AUC-PR.
- Off-by-one in F1. , NOT — that's the arithmetic mean. Harmonic mean punishes one bad value.
- k-means with k=1. Trivial. Always pick and use elbow method / silhouette score to choose.
- Polynomial regression is 'non-linear'. Wrong — linear in coefficients (you still solve normal equations). Truly non-linear models need iterative optimisation.
Shortcuts
- Logistic SGD: .
- Why not MSE? Non-convex + vanishing gradient with sigmoid.
- Kaiming init for ReLU: factor of 2 to compensate for half-zeroing.
- Speed-up six: BN, higher LR, dropout, shuffle, ↓L2, LR decay.
- LSTM additive cell state = constant error carousel — solves vanishing gradient.
- Bagging reduces variance, Boosting reduces bias. RF = bagging. AdaBoost (Viola-Jones) = boosting.
- Precision when FP costly; Recall when FN costly.
- PR > ROC on imbalanced data.
- **-means issues**: hard assignment, spherical bias, init-sensitive, outlier-sensitive. Fix: -means++, GMM, -medoids.
- PCA = eigenvectors of cov; SVD numerically stable; LoRA = low-rank update.
Proofs / Algorithms
**Cross-entropy gradient = .** Shown above by chain rule; the cancellation between and produces a clean, scale-stable gradient — the reason CE is the canonical classification loss.
PCA solution is top eigenvectors. Variance along = ; subject to , Lagrangian gives → eigen-equation. Successive directions found by deflating and repeating.
Eckart–Young. For , the best rank- approximation in Frobenius (or operator) norm is , with error .
EM monotonically improves likelihood. E-step computes tight lower bound; M-step maximises bound; therefore likelihood at ≥ likelihood at by Jensen's inequality.