A Cheatsheet for Machine Learning & Deep Learning — for ML and Quantitative Research

Part One

Mathematical Foundations

Every method below reduces, ultimately, to four pillars: linear algebra (how we represent data and transformations), probability (how we quantify uncertainty), optimization (how we learn from data), and information theory (how we measure signal). A researcher who has internalised these pillars can read any new architecture or paper and reconstruct its logic from first principles.

Linear Algebra

Inner product: $⟨ x, y ⟩ = x^{⊤} y = \sum_{i} x_{i} y_{i}$ . Orthogonality: $⟨ x, y ⟩ = 0$ .
Norms: $∥ x ∥_{2} = x^{⊤} x$ , $∥ x ∥_{1} = \sum_{i} ∣ x_{i} ∣$ , $∥ x ∥_{\infty} = max_{i} ∣ x_{i} ∣$ . All $ℓ_{p}$ for $p \geq 1$ are norms.
Rank: $rank (A)$ is the dimension of the column space; $A \in R^{m \times n}$ has $rank (A) \leq min (m, n)$ .
Positive definite: $A ≻ 0 ⟺ x^{⊤} A x > 0$ for all $x \neq = 0$ ; equivalently all eigenvalues are positive.

Eigendecomposition

For symmetric $A \in R^{n \times n}$ there exists an orthogonal $Q$ and diagonal $Λ$ such that $A = Q Λ Q^{⊤}$ , with eigenvectors as columns of $Q$ and eigenvalues on the diagonal of $Λ$ . Symmetric matrices have real eigenvalues and orthogonal eigenvectors.

Singular Value Decomposition (SVD)

For any $A \in R^{m \times n}$ , $A = U Σ V^{⊤}$ with $U \in R^{m \times m}$ , $V \in R^{n \times n}$ orthogonal and $Σ$ diagonal with singular values $σ_{1} \geq σ_{2} \geq \dots \geq 0$ . SVD gives the best rank- $k$ approximation in Frobenius and spectral norms (Eckart–Young):

$A_{k} = i = 1 \sum k σ_{i} u_{i} v_{i}^{⊤}, ∥ A - A_{k} ∥_{F}^{2} = i > k \sum σ_{i}^{2} .$

Why it mattersSVD is the computational core of PCA, matrix completion, low-rank regression, least-squares via pseudo-inverse

A^{+} = V Σ^{+} U^{⊤}

, and numerically-stable solutions to ill-conditioned problems.

Matrix calculus — identities to memorise

Expression	Gradient w.r.t. $x$
$a^{⊤} x$	$a$
$x^{⊤} A x$	$(A + A^{⊤}) x$ ; $= 2 A x$ if $A$ symmetric
$∥ x ∥_{2}^{2}$	$2 x$
$∥ A x - b ∥_{2}^{2}$	$2 A^{⊤} (A x - b)$
$lo g det (X)$ (w.r.t. $X$ )	$X^{- ⊤}$
$tr (A X)$ (w.r.t. $X$ )	$A^{⊤}$

Probability & Statistics

Bayes' rule is the only rational procedure for updating beliefs given evidence; all probabilistic ML is a specialization of it.

Bayes' rule. $p (θ ∣ x) = \frac{p ( x ∣ θ ) p ( θ )}{p ( x )}$ . The denominator is the marginal likelihood, $p (x) = \int p (x ∣ θ) p (θ) d θ$ .

Maximum likelihood (MLE). $\hat{θ}_{MLE} = ar g max_{θ} \sum_{i} lo g p (x_{i} ∣ θ)$ . Asymptotically normal with variance given by the inverse Fisher information $I (θ)^{- 1}$ .

Maximum a posteriori (MAP). $\hat{θ}_{MAP} = ar g max_{θ} \sum_{i} lo g p (x_{i} ∣ θ) + lo g p (θ)$ . Gaussian prior $\Rightarrow$ $ℓ_{2}$ regularization; Laplace prior $\Rightarrow$ $ℓ_{1}$ .

Useful distributions & conjugacies

Likelihood	Conjugate prior	Posterior form
Bernoulli / Binomial	Beta $(α, β)$	Beta $(α + k, β + n - k)$
Gaussian (known $σ^{2}$ )	Gaussian	Gaussian with precision-weighted mean
Gaussian (unknown $σ^{2}$ )	Normal–Inverse-Gamma	Normal–Inverse-Gamma
Multinomial	Dirichlet	Dirichlet with added counts
Poisson	Gamma	Gamma with added counts / rate

Expectation, variance, covariance. $Var (X) = E [X^{2}] - E [X]^{2}$ . For independent $X, Y$ : $Var (X + Y) = Var (X) + Var (Y)$ . Covariance matrix $Σ_{ij} = Cov (X_{i}, X_{j})$ is symmetric PSD.

Law of Large Numbers & CLT. For i.i.d. $X_{i}$ with mean $μ$ , variance $σ^{2}$ : $\overset{ˉ}{X}_{n} \to μ$ a.s. and $n (\overset{ˉ}{X}_{n} - μ) \Rightarrow N (0, σ^{2})$ .

Delta method. If $n (\hat{θ} - θ) \Rightarrow N (0, σ^{2})$ and $g$ is differentiable at $θ$ , then $n (g (\hat{θ}) - g (θ)) \Rightarrow N (0, (g^{'} (θ))^{2} σ^{2})$ .

Optimization

Convexity. $f$ is convex iff $f (λ x + (1 - λ) y) \leq λ f (x) + (1 - λ) f (y)$ . A twice-differentiable $f$ is convex iff $\nabla^{2} f ⪰ 0$ . For convex problems, every local min is global.

Gradient descent. $θ_{t + 1} = θ_{t} - η \nabla f (θ_{t})$ . For $L$ -smooth convex $f$ (i.e. $\nabla f$ is $L$ -Lipschitz), choosing $η \leq 1/ L$ guarantees $f (θ_{t}) - f^{⋆} \leq O (1/ t)$ . For $μ$ -strongly convex, linear convergence $O (e^{- t μ / L})$ .

Stochastic gradient descent. Replaces $\nabla f$ with an unbiased mini-batch estimator $\overset{g}{^}_{t}$ , $E [\overset{g}{^}_{t}] = \nabla f$ . Requires decreasing step sizes $\sum η_{t} = \infty$ , $\sum η_{t}^{2} < \infty$ (Robbins–Monro) for convergence of the iterates under convexity and bounded variance.

Constrained optimization — KKT. For $min f (x)$ s.t. $g_{i} (x) \leq 0$ , $h_{j} (x) = 0$ , the Lagrangian is $L (x, λ, ν) = f (x) + \sum λ_{i} g_{i} (x) + \sum ν_{j} h_{j} (x)$ . At an optimum (under constraint qualifications): stationarity, primal/dual feasibility, and complementary slackness $λ_{i} g_{i} (x^{⋆}) = 0$ .

Researcher's ruleIf you can recognise a problem as convex, choose the method that exploits its structure (closed-form, second-order, dual). Non-convex methods should be your last resort, not the default.

Information Theory

Entropy: $H (p) = - \sum_{x} p (x) lo g p (x)$ . Concave; maximised by uniform.
Cross-entropy: $H (p, q) = - \sum_{x} p (x) lo g q (x) = H (p) + D_{KL} (p ∥ q)$ .
KL divergence: $D_{KL} (p ∥ q) = \sum_{x} p (x) lo g \frac{p ( x )}{q ( x )} \geq 0$ , non-symmetric.
Mutual information: $I (X; Y) = D_{KL} (p (x, y) ∥ p (x) p (y)) = H (X) - H (X ∣ Y)$ .

Why cross-entropy is the classification loss. Minimising $H (p_{data}, q_{θ}) = E_{x \sim p_{data}} [- lo g q_{θ} (x)]$ is equivalent to minimising $D_{KL} (p_{data} ∥ q_{θ})$ , since $H (p_{data})$ is constant in $θ$ . This is maximum likelihood.

Jensen's inequality. For convex $φ$ : $φ (E [X]) \leq E [φ (X)]$ . Underlies the ELBO, importance sampling bounds, and most variational arguments.

Part Two

Core Machine-Learning Principles

We describe the invariants that apply to every model — the decomposition of error, the shape of regularization, and the vocabulary of evaluation.

The Bias–Variance Decomposition

For a point $x_{0}$ and a learning algorithm producing $\hat{f}$ on random training sets:

$expected squared error E [(y - \hat{f} (x_{0}))^{2}] = Bias^{2} (E [\hat{f} (x_{0})] - f (x_{0}))^{2} + Variance Var [\hat{f} (x_{0})] + irreducible σ_{ε}^{2} .$

High bias (underfitting)

Training and validation error both high
Model class too restrictive or under-trained
Fix: richer class, more features, less regularization

High variance (overfitting)

Low training error, high validation error
Model memorises noise
Fix: more data, regularization, simpler model, ensembling

Modern caveatFor very over-parameterized models (large NNs), the classical U-curve can give way to double descent: test error rises at the interpolation threshold, then falls again as capacity grows further. The decomposition still applies — implicit regularization of the optimizer controls variance.

Regularization

Ridge ( $ℓ_{2}$ ). Minimise $∥ y - X β ∥_{2}^{2} + λ ∥ β ∥_{2}^{2}$ . Closed form: $\hat{β} = (X^{⊤} X + λ I)^{- 1} X^{⊤} y$ . Always well-conditioned; shrinks coefficients uniformly.

Lasso ( $ℓ_{1}$ ). Minimise $∥ y - X β ∥_{2}^{2} + λ ∥ β ∥_{1}$ . No closed form; solved via coordinate descent or proximal gradient. Induces sparsity — exact zeros — because the $ℓ_{1}$ ball has corners along the coordinate axes.

Elastic net. $λ_{1} ∥ β ∥_{1} + λ_{2} ∥ β ∥_{2}^{2}$ . Retains sparsity of Lasso while handling correlated features (groups them rather than arbitrarily picking one).

Equivalences. $ℓ_{2}$ penalty ≡ Gaussian prior on weights; $ℓ_{1}$ ≡ Laplace prior; data augmentation ≡ marginalising over an implicit prior.

Evaluation & Model Selection

Cross-validation

k-Fold — default for i.i.d. tabular data; pick $k \in {5, 10}$ .
Stratified k-Fold — preserves class balance in classification.
Leave-one-out (LOO) — low bias, high variance; expensive.
Nested CV — outer loop estimates generalisation, inner loop tunes hyperparameters; avoids optimistic bias from reusing the same data for both.
Time-series CV — use forward chaining (expanding/rolling window). Never shuffle.

Classification metrics

Metric	Definition	When to use
Accuracy	$(T P + T N) / n$	Balanced classes only
Precision	$T P / (T P + F P)$	Cost of false positives is high
Recall	$T P / (T P + F N)$	Cost of false negatives is high
$F_{1}$	$2 \cdot P R / (P + R)$	Imbalanced, single summary
ROC-AUC	$P (\overset{s}{^}_{+} > \overset{s}{^}_{-})$	Threshold-free ranking quality
PR-AUC	Area under Precision–Recall	Very imbalanced (rare positives)
LogLoss	$- \frac{1}{n} \sum [y lo g \overset{p}{^} + (1 - y) lo g (1 - \overset{p}{^})]$	Probabilistic, calibration-sensitive

Regression metrics

MSE (penalises outliers quadratically), MAE (robust), RMSE (same units as $y$ ), $R^{2} = 1 - RSS / TSS$ , MAPE (scale-free but undefined at $y = 0$ ), Pearson / Spearman / Kendall correlation — indispensable in quant research where rank of predictions matters more than absolute values.

Common pitfallAccuracy is nearly meaningless at class imbalance of, say, 99:1. A constant predictor reaches 99%. Always pair it with precision/recall, ROC-AUC, or PR-AUC depending on the problem.

Feature engineering checklist

Scaling: Standardise or min-max for distance-based models (kNN, SVM, k-means) and any gradient-based optimization (faster, more stable convergence). Trees are scale-invariant.
Encoding: One-hot for low-cardinality categoricals; target/mean encoding with out-of-fold statistics for high-cardinality (prevents target leakage).
Missing data: Explicit missing indicator + imputation beats silent imputation, since "missing" is often informative.
Leakage audit: Any feature that would not be available at prediction time is leakage. The most expensive bugs in ML are silent leakages.

III

Part Three

Classical Algorithms

Before neural networks, these are the tools that set the baselines. In quantitative research, linear methods and gradient-boosted trees still dominate tabular tasks — they are compact, interpretable, and hard to beat with modest data.

Linear Models

Ordinary Least Squares

$\hat{β} = (X^{⊤} X)^{- 1} X^{⊤} y$ . Assumes linearity, no multicollinearity, homoscedastic and uncorrelated errors. The Gauss–Markov theorem: OLS is BLUE (Best Linear Unbiased Estimator) under these assumptions.

Logistic Regression

Models $p (y = 1 ∣ x) = σ (w^{⊤} x + b)$ with $σ (z) = 1/ (1 + e^{- z})$ . Negative log-likelihood:

$L (w) = - i \sum [y_{i} lo g σ (w^{⊤} x_{i}) + (1 - y_{i}) lo g (1 - σ (w^{⊤} x_{i}))] .$

Gradient: $\nabla_{w} L = \sum_{i} (σ (w^{⊤} x_{i}) - y_{i}) x_{i} = X^{⊤} (\hat{p} - y)$ . Hessian is PSD $\Rightarrow$ convex; optimizable by IRLS / L-BFGS / SGD.

Multinomial / softmax. $p (y = k ∣ x) = e^{w_{k}^{⊤} x} / \sum_{j} e^{w_{j}^{⊤} x}$ . One coefficient vector is redundant (shift invariance); often fix $w_{K} = 0$ .

Generalised linear models (GLMs)

Exponential-family likelihood with link function $g$ : $g (μ_{i}) = x_{i}^{⊤} β$ . Linear regression ( $g = id$ , Gaussian), logistic ( $g = logit$ , Bernoulli), Poisson regression ( $g = lo g$ , Poisson). Inference via IRLS.

Trees & Ensembles

Decision trees recursively split on the feature and threshold that maximise impurity reduction. Classification uses Gini $\sum_{k} p_{k} (1 - p_{k})$ or entropy $- \sum_{k} p_{k} lo g p_{k}$ ; regression uses variance reduction. Single trees overfit; depth and leaf-count are the main regularisers.

Trees are scale-invariant, handle mixed types and missing values natively, and are the only mainstream family that can represent non-smooth, interaction-heavy functions without feature engineering.

Random Forests. Bagging: train $B$ trees on bootstrap samples, and at each split consider only a random $m$ of $p$ features. Variance-reduction ensemble: $Var (\overset{ˉ}{f}) = ρ σ^{2} + (1 - ρ) σ^{2} / B$ , where $ρ$ is the between-tree correlation. Feature subsampling lowers $ρ$ .

Gradient Boosting. Fit tree $h_{t}$ to the negative gradient of the loss w.r.t. the current prediction $F_{t - 1}$ : $F_{t} = F_{t - 1} + η h_{t}$ . With squared loss, the gradient is the residual; with logistic loss, it is $y - \overset{p}{^}$ . Implementations (XGBoost, LightGBM, CatBoost) add second-order terms, regularized leaves, histogram splits, and native categorical handling.

	Bagging / RF	Boosting / GBM
Primary effect	Reduces variance	Reduces bias
Base learner	Deep trees	Shallow trees (stumps to depth 6)
Parallelism	Embarrassingly parallel	Sequential in trees
Overfitting risk	Low with enough trees	High; controlled by $η$ , early stopping, depth

Kernels, k-NN, Naive Bayes

Support Vector Machines

Maximise the margin $2/∥ w ∥$ subject to $y_{i} (w^{⊤} x_{i} + b) \geq 1 - ξ_{i}$ , $ξ_{i} \geq 0$ . Primal:

$w, b, ξ min \frac{1}{2} ∥ w ∥^{2} + C i \sum ξ_{i} s.t. y_{i} (w^{⊤} x_{i} + b) \geq 1 - ξ_{i} .$

Dual only depends on inner products $⟨ x_{i}, x_{j} ⟩$ , enabling the kernel trick: replace with $K (x_{i}, x_{j})$ corresponding to an inner product in some RKHS.

Common kernels: linear $x^{⊤} y$ ; polynomial $(x^{⊤} y + c)^{d}$ ; RBF $exp (- ∥ x - y ∥^{2} /2 σ^{2})$ ; Matérn (GP-friendly).

k-Nearest Neighbours

No training; classify by majority among $k$ nearest points. Curse of dimensionality: in high $d$ , distances concentrate and all points become roughly equidistant. Useful for small, low- $d$ problems and as a local baseline.

Naive Bayes

Assumes features are conditionally independent given class: $p (x ∣ y) = \prod_{j} p (x_{j} ∣ y)$ . Decision: $ar g max_{y} p (y) \prod_{j} p (x_{j} ∣ y)$ . Despite the crude independence assumption, it is an extraordinarily strong text-classification baseline.

Unsupervised Learning

k-Means

Minimises $\sum_{i} ∥ x_{i} - μ_{c (i)} ∥^{2}$ by alternating (i) assign each point to nearest centroid, (ii) update centroids to cluster means. Non-convex; use k-means++ seeding. Assumes spherical, equal-variance clusters.

Gaussian Mixture Models (EM)

$p (x) = \sum_{k = 1}^{K} π_{k} N (x ∣ μ_{k}, Σ_{k})$ . Fit via EM: E-step compute responsibilities $γ_{ik} = π_{k} N (x_{i} ∣ μ_{k}, Σ_{k}) / \sum_{j} π_{j} N (x_{i} ∣ μ_{j}, Σ_{j})$ ; M-step update $π_{k}, μ_{k}, Σ_{k}$ by weighted MLE. EM monotonically increases the log-likelihood.

Principal Component Analysis

Find orthonormal directions of maximum variance. Center $X$ and compute top- $k$ eigenvectors of the covariance $\frac{1}{n} X^{⊤} X$ — equivalently, top- $k$ right singular vectors of $X$ . The first PC maximises $Var (X v)$ subject to $∥ v ∥ = 1$ , giving $v_{1} = ar g max v^{⊤} Σ v$ — the leading eigenvector.

t-SNE / UMAP. Non-linear embeddings for visualisation. Preserve local neighbourhoods but not global distances; never interpret cluster sizes or between-cluster distances literally.

In quant researchPCA on a return covariance matrix yields statistical factors; the leading component is typically the market, the next few correspond to sector/style factors. PCA-based residualisation is the simplest device for constructing market-neutral signals.

Part Four

Deep Learning: Foundations

A neural network is a parameterised function $f_{θ} : X \to Y$ built by composing affine maps and non-linearities. Everything below — activations, initializations, normalizations, optimizers — exists to make the composition of very many such layers trainable at scale.

Multi-layer perceptron

$h^{(ℓ)} = ϕ (W^{(ℓ)} h^{(ℓ - 1)} + b^{(ℓ)})$ , $h^{(0)} = x$ . Universal approximation: a single hidden layer with enough units can approximate any continuous function on a compact set. In practice, depth buys far more expressivity per parameter than width.

Activation functions

Name	Formula	Notes
Sigmoid	$σ (z) = 1/ (1 + e^{- z})$	Saturates; vanishing gradients. Output layer for binary.
Tanh	$tanh (z)$	Zero-centered; still saturates.
ReLU	$max (0, z)$	Sparse, non-saturating for $z > 0$ . Dying-ReLU risk.
Leaky ReLU / PReLU	$max (α z, z)$	Small slope for $z < 0$ keeps gradient alive.
GELU	$z \cdot Φ (z)$	Smooth ReLU. Standard in Transformers.
Swish / SiLU	$z \cdot σ (z)$	Smooth, self-gated.
Softmax	$e^{z_{k}} / \sum_{j} e^{z_{j}}$	Multi-class output only.

Backpropagation

Backprop is reverse-mode automatic differentiation applied to a DAG of differentiable primitives. For each node, given $\partial L / \partial h^{(ℓ)}$ , compute $\partial L / \partial h^{(ℓ - 1)}$ and $\partial L / \partial W^{(ℓ)}$ by the chain rule. For the MLP above, with $z^{(ℓ)} = W^{(ℓ)} h^{(ℓ - 1)} + b^{(ℓ)}$ :

$δ^{(ℓ)} = \frac{\partial L}{\partial z ^{(ℓ)}} = (W^{(ℓ + 1) ⊤} δ^{(ℓ + 1)}) ⊙ ϕ^{'} (z^{(ℓ)}), \frac{\partial L}{\partial W ^{(ℓ)}} = δ^{(ℓ)} h^{(ℓ - 1) ⊤} .$

Vanishing & exploding gradients. Products of Jacobian norms $\prod_{ℓ} ∥ J^{(ℓ)} ∥$ either collapse to zero or diverge as depth grows. Mitigations: careful initialization, normalization layers, residual connections, gradient clipping.

Initialization

Xavier / Glorot: $W_{ij} \sim N (0, 2/ (n_{in} + n_{out}))$ . Suited to tanh / sigmoid.
He / Kaiming: $W_{ij} \sim N (0, 2/ n_{in})$ . Suited to ReLU.
Orthogonal: $W$ a random orthogonal matrix; preserves activation norms through depth. Good for RNNs.

Optimizers

All practical optimizers for deep learning fit the pattern: maintain a state of first- and second-moment estimates of the gradient, apply scaled updates, optionally decouple weight decay.

Optimizer	Update (schematic)	Notes
SGD	$θ \leftarrow θ - η g$	Often best for convolutional vision, paired with momentum & schedule.
SGD + Momentum	$v \leftarrow μ v + g; θ \leftarrow θ - η v$	Smooths gradient direction; $μ \approx 0.9$ .
Nesterov	Evaluates $g$ at the look-ahead point	Theoretical improvement; modest in practice.
RMSProp	$s \leftarrow ρ s + (1 - ρ) g^{2}; θ \leftarrow θ - η g / s + ϵ$	Per-parameter scaling by gradient magnitude.
Adam	$m \leftarrow β_{1} m + (1 - β_{1}) g; v \leftarrow β_{2} v + (1 - β_{2}) g^{2}; θ \leftarrow θ - η \hat{m} / (\hat{v} + ϵ)$	Momentum + RMSProp + bias correction.
AdamW	As Adam but $θ \leftarrow θ - η \hat{m} / (\hat{v} + ϵ) - η λ θ$	Decoupled weight decay; current default for Transformers.

Learning-rate schedules

Step / exponential decay — classical, robust.
Cosine annealing $η_{t} = η_{m i n} + \frac{1}{2} (η_{m a x} - η_{m i n}) (1 + cos (π t / T))$ — smooth, widely used.
Warmup — linearly ramp $η$ for the first few thousand steps; essential for Transformers, where early updates are otherwise catastrophic.
One-cycle — warm up, then cosine down; enables aggressive max LRs.

Regularization & Normalization

Weight decay: Adds $\frac{λ}{2} ∥ θ ∥^{2}$ to the loss; equivalent to Gaussian prior. Implement as decoupled (AdamW) for adaptive optimizers.
Dropout: Randomly zero each activation with probability $p$ at training time, scale by $1/ (1 - p)$ . Approximates averaging over an exponential ensemble of sub-networks.
Early stopping: Halt when validation loss stops improving. Implicit regularization equivalent to an $ℓ_{2}$ ball in some linear settings.
Data augmentation: Learn invariances (flips, crops, MixUp, CutMix, noise). Essentially the most effective regularizer in computer vision.
Label smoothing: Replace one-hot target $y$ with $(1 - ε) y + ε / K$ . Prevents overconfident logits; improves calibration.

Normalization

Layer	Normalizes over	Typical use
BatchNorm	Batch dim, per channel	CNNs; requires large, stable batches.
LayerNorm	Feature dim, per example	Transformers, RNNs; batch-size agnostic.
GroupNorm	Groups of channels, per example	Small batches, detection, segmentation.
RMSNorm	Feature dim, no mean centring	Modern LLMs; cheaper, performs well.

Classical vs modern viewBatchNorm was originally motivated as reducing "internal covariate shift" — later work argued its real benefit is smoothing the loss landscape. Either way: the effect is a significantly higher tolerated learning rate and faster training.

Loss functions

MSE: $\frac{1}{n} \sum (y_{i} - \overset{y}{^}_{i})^{2}$ . MLE under Gaussian noise.
MAE / Huber: Robust to outliers; Huber is quadratic near zero, linear in the tail.
Binary cross-entropy: $- [y lo g \overset{p}{^} + (1 - y) lo g (1 - \overset{p}{^})]$ . MLE for Bernoulli.
Categorical CE: $- \sum_{k} y_{k} lo g \overset{p}{^}_{k}$ . MLE for Categorical / softmax.
Contrastive / InfoNCE: $- lo g \frac{e ^{s (x, x^{+}) / τ}}{\sum _{j} e ^{s (x, x_{j}) / τ}}$ . Lower bound on mutual information; core of self-supervised learning.

Part Five

Deep Learning: Architectures

Convolutional Networks (CNN)

A convolution layer applies a small shared kernel $K$ across spatial positions: $(X * K)_{i, j} = \sum_{u, v} X_{i + u, j + v} K_{u, v}$ . Parameter-sharing + translation equivariance are the inductive biases that make CNNs so efficient on images.

Output spatial size after a conv with kernel $k$ , padding $p$ , stride $s$ : $⌊(n + 2 p - k) / s ⌋ + 1$ . Receptive field grows linearly with depth for fixed $k$ ; dilations expand it exponentially.

Pooling (max, average) downsamples and builds local invariance. Modern architectures often replace pooling with strided convolutions.

Key designs

ResNet. $h^{(ℓ)} = h^{(ℓ - 1)} + F (h^{(ℓ - 1)})$ . Residual connections let gradients flow through identity paths — the reason networks can go to hundreds of layers without collapsing.
Inception / MobileNet. Factorised / depthwise-separable convolutions: $k \times k \times c$ → ( $k \times k$ depthwise) + ( $1 \times 1$ pointwise). Same expressivity, a fraction of the FLOPs.
U-Net. Encoder–decoder with skip connections at each resolution. Dominant for segmentation; the backbone of most diffusion models.

Recurrent Networks (RNN / LSTM / GRU)

Vanilla RNN: $h_{t} = tanh (W_{h} h_{t - 1} + W_{x} x_{t} + b)$ . Backprop through time unrolls the recurrence; gradients flow through $T$ products of Jacobians, producing vanishing/exploding gradients on long sequences.

LSTM

Gated cell with explicit memory $c_{t}$ :

$f_{t} i_{t} o_{t} \tilde{c}_{t} c_{t} h_{t} = σ (W_{f} [h_{t - 1}, x_{t}]) (forget) = σ (W_{i} [h_{t - 1}, x_{t}]) (input) = σ (W_{o} [h_{t - 1}, x_{t}]) (output) = tanh (W_{c} [h_{t - 1}, x_{t}]) = f_{t} ⊙ c_{t - 1} + i_{t} ⊙ \tilde{c}_{t} = o_{t} ⊙ tanh (c_{t})$

The additive cell update is the key: gradients propagate through $c$ without repeated multiplication by Jacobians, taming vanishing gradients. GRU is a streamlined variant with 2 gates and no separate cell state.

Attention & Transformers

Scaled dot-product attention. Given queries $Q \in R^{n \times d_{k}}$ , keys $K \in R^{n \times d_{k}}$ , values $V \in R^{n \times d_{v}}$ :

$Attention (Q, K, V) = softmax (\frac{Q K ^{⊤}}{d _{k}}) V .$

The $1/ d_{k}$ keeps the dot products at a scale where softmax is not saturated. Complexity is $O (n^{2} d)$ in sequence length.

Multi-head attention. Run $h$ heads in parallel with separate $W_{q}^{(i)}, W_{k}^{(i)}, W_{v}^{(i)}$ , concatenate outputs, project with $W_{o}$ . Each head learns a different relational pattern.

Transformer block. Pre-norm variant, the modern default:

x = x + MHA(LN(x))            # self-attention sub-layer
x = x + MLP(LN(x))            # feed-forward sub-layer, typically 4x width

Positional information. Attention is permutation-equivariant; inject order via sinusoidal embeddings, learned absolute positions, or — in modern LLMs — Rotary Position Embeddings (RoPE) applied to $Q$ and $K$ .

Causal vs bidirectional. Decoder-only (GPT-style) masks future positions — used for autoregressive generation. Encoder (BERT-style) sees both directions — used for embedding / classification.

Scaling lawsFor compute-optimal training, model size

N

and data

D

should scale roughly proportionally (Chinchilla-style). Loss

L (N, D)

follows power laws in each with an irreducible floor; an extra order of magnitude of either gives diminishing returns unless both grow.

Generative Models

Autoencoders & VAEs

An autoencoder learns an identity through a bottleneck, $x \approx g (f (x))$ . A Variational AE places a latent prior $p (z)$ and an inference network $q_{ϕ} (z ∣ x)$ , maximising the ELBO:

$lo g p_{θ} (x) \geq E_{q_{ϕ} (z ∣ x)} [lo g p_{θ} (x ∣ z)] - D_{KL} (q_{ϕ} (z ∣ x) ∥ p (z)) .$

Trained by reparameterisation: $z = μ_{ϕ} (x) + σ_{ϕ} (x) ⊙ ε$ with $ε \sim N (0, I)$ .

Generative Adversarial Networks

Minimax game between generator $G$ and discriminator $D$ :

$G min D max E_{x \sim p_{data}} [lo g D (x)] + E_{z \sim p_{z}} [lo g (1 - D (G (z)))] .$

Non-saturating generator loss, Wasserstein GAN with gradient penalty, spectral normalisation — all are stabilisations around this same objective.

Diffusion models

Forward process adds Gaussian noise: $x_{t} = \overset{α}{ˉ}_{t} x_{0} + 1 - \overset{α}{ˉ}_{t} ε$ . A network $ε_{θ} (x_{t}, t)$ is trained to predict the noise; sampling reverses the chain. Simple loss:

$L = E_{t, x_{0}, ε} [∥ ε - ε_{θ} (x_{t}, t) ∥^{2}] .$

Variants: score-matching / SDE formulation, DDIM for deterministic sampling, classifier-free guidance for conditional generation.

Part Six

Topics for Quantitative Research

Quantitative research inherits the full ML toolkit but operates under constraints foreign to most ML papers: low signal-to-noise, non-stationarity, non-i.i.d. samples, and economic costs for every false positive. The techniques below address these constraints directly.

Time Series

Stationarity: A process is (weakly) stationary if its mean and autocovariance are time-invariant. Most classical methods assume stationarity; in finance, price series are non-stationary, but returns are approximately so.
ARMA(p,q): $X_{t} = \sum_{i = 1}^{p} ϕ_{i} X_{t - i} + ε_{t} + \sum_{j = 1}^{q} θ_{j} ε_{t - j}$ . AR $(p)$ : last $p$ lags; MA $(q)$ : last $q$ innovations.
ARIMA(p,d,q): Difference the series $d$ times to achieve stationarity, then fit ARMA.
GARCH(1,1): $σ_{t}^{2} = ω + α ε_{t - 1}^{2} + β σ_{t - 1}^{2}$ . Captures volatility clustering — the single most robust empirical feature of financial returns.
Cointegration: Non-stationary $X_{t}, Y_{t}$ may have a stationary linear combination $Y_{t} - β X_{t}$ . Foundation of pairs / statistical-arbitrage strategies.

Autocorrelation & tests

ACF $ρ (k) = Corr (X_{t}, X_{t - k})$ ; PACF is the partial autocorrelation at lag $k$ controlling for intermediate lags. Ljung–Box tests jointly if the first $k$ autocorrelations are zero. ADF / KPSS test (non-)stationarity.

State-space models & the Kalman filter

Linear Gaussian state-space model:

$x_{t} = F x_{t - 1} + w_{t}, w_{t} \sim N (0, Q); y_{t} = H x_{t} + v_{t}, v_{t} \sim N (0, R) .$

Kalman recursion alternates predict and update. It is the optimal linear estimator and, under Gaussian noise, the optimal estimator full stop. Used in quant for dynamic hedge ratios, time-varying betas, and signal smoothing.

Portfolio Theory & Risk

Mean–variance optimization

Minimum-variance portfolio with target return $μ^{⋆}$ :

$w min \frac{1}{2} w^{⊤} Σ w s.t. w^{⊤} 1 = 1, w^{⊤} μ = μ^{⋆} .$

Closed-form solution via Lagrange multipliers. Tangency (max-Sharpe) portfolio: $w^{⋆} \propto Σ^{- 1} (μ - r_{f} 1)$ .

Risk & performance metrics

Metric	Definition	Notes
Sharpe ratio	$(E [R] - r_{f}) / σ_{R}$	Annualize by $252$ for daily.
Sortino	Uses downside deviation only	Penalises only negative volatility.
Information ratio	$E [R - R_{b}] / σ_{R - R_{b}}$	Quality of active return vs benchmark.
Max drawdown	$max_{t} (sup_{s \leq t} P_{s} - P_{t}) / sup_{s \leq t} P_{s}$	Peak-to-trough loss.
VaR $_{α}$	$- in f {x : P (L \leq x) \geq α}$	Quantile; not subadditive.
CVaR / ES	$E [L ∣ L > VaR_{α}]$	Coherent; average tail loss.

Covariance estimation

The sample covariance $\hat{Σ}$ is noisy when $p ≳ n$ and inverting it (as in MV optimization) amplifies that noise. Remedies: Ledoit–Wolf shrinkage $\hat{Σ}_{LW} = (1 - λ) \hat{Σ} + λ μ I$ ; factor-model covariance $Σ = B Σ_{f} B^{⊤} + D$ ; minimum-variance with no short-sell constraints acts as implicit regularization.

Backtesting: Hazards & Hygiene

A backtest is a hypothesis about the pastIt becomes a prediction about the future only if every source of look-ahead, selection, and multiple-testing bias has been ruled out. Most "strategies" that fail in production failed the backtest too — readers just did not notice.

Look-ahead bias. Using data not yet available at the trade decision time. Easily hidden in resampling, normalization, or target encodings computed on the full sample.
Survivorship bias. Universe includes only firms that exist today, not the dead ones. Systematically inflates returns.
Selection bias. Data-mining a strategy from a large search space without correcting for multiplicity. See deflated Sharpe ratio and the Bailey–López de Prado corrections.
Transaction costs & slippage. A signal that is alpha gross of costs can easily be zero or negative net of them.
Regime / non-stationarity. A backtest spanning one regime cannot validate a strategy in another. Always report performance per sub-period.
Purged / embargoed cross-validation. Remove observations whose label windows overlap training data; embargo a gap after each test fold to prevent leakage of serial correlation.

Time-series cross-validation

Forward-chaining expanding-window CV respects causality. For overlapping labels (e.g. forward $h$ -day returns), apply purging — remove training samples whose label window intersects the test window — and embargo — a buffer of length $h$ after the test set.

ML for Alpha — What Actually Matters

Signal-to-noise is the enemy. Daily returns have SNR orders of magnitude lower than images or text. Capacity of the model must match the information in the data — enormous networks overfit noise.
Prefer rank-based and robust losses. Squared error is dominated by outliers. Quantile, Huber, and rank-based losses (Spearman surrogates) typically generalise better for return prediction.
Feature stability beats peak accuracy. A signal that works moderately across sub-periods is more valuable than one that is excellent in one and useless in another.
Combine linear + non-linear. Linear factors remain the bulk of explanatory power; trees / NNs add interactions. Stack or residualise.
Bet sizing is orthogonal to signal. Kelly (or a fractional Kelly) links expected edge and variance to position size. A good signal sized poorly is a loss-making strategy.
Evaluate economic utility, not just statistical fit. Out-of-sample Sharpe, turnover, capacity, drawdown, and sensitivity to costs — in that order.

VII

Part Seven

Exercises & Solutions

Each exercise is labelled by topic and difficulty (◆ introductory, ◆◆ intermediate, ◆◆◆ advanced). Solutions are hidden by default — attempt the problem before revealing the worked argument. The point of these is not the answer itself but the chain of reasoning; they are chosen to exercise canonical techniques you will reuse for the rest of your career.

Exercise 1Statistics — Bias–Variance◆◆

Derive the bias–variance decomposition.

Let $y = f (x) + ε$ with $E [ε] = 0$ and $Var (ε) = σ^{2}$ , $ε ⊥ \hat{f}$ . For a learned predictor $\hat{f} (x_{0})$ (random over training sets), show that $E [(y - \hat{f} (x_{0}))^{2}] = (E [\hat{f} (x_{0})] - f (x_{0}))^{2} + Var (\hat{f} (x_{0})) + σ^{2} .$

Reveal solution

Let $\overset{ˉ}{f} (x_{0}) = E [\hat{f} (x_{0})]$ . Expand: $E [(y - \hat{f} (x_{0}))^{2}] = E [(f (x_{0}) + ε - \hat{f} (x_{0}))^{2}] .$ Because $ε$ is independent of $\hat{f}$ with zero mean, cross terms vanish: $= E [ε^{2}] + E [(f (x_{0}) - \hat{f} (x_{0}))^{2}] = σ^{2} + E [(f (x_{0}) - \hat{f} (x_{0}))^{2}] .$ Add and subtract $\overset{ˉ}{f} (x_{0})$ inside the square: $E [(f (x_{0}) - \overset{ˉ}{f} (x_{0}) + \overset{ˉ}{f} (x_{0}) - \hat{f} (x_{0}))^{2}] .$ The cross term $2 (f (x_{0}) - \overset{ˉ}{f} (x_{0})) E [\overset{ˉ}{f} (x_{0}) - \hat{f} (x_{0})] = 0$ because $E [\hat{f} (x_{0})] = \overset{ˉ}{f} (x_{0})$ . The two remaining terms are Bias $^{2}$ and Variance, giving the claim. $■$

Exercise 2Linear Models — Ridge◆

Derive the closed form for ridge regression.

For $X \in R^{n \times p}$ , $y \in R^{n}$ , $λ \geq 0$ , find $β$ minimising $∥ y - X β ∥_{2}^{2} + λ ∥ β ∥_{2}^{2}$ . Show the solution is always well-defined when $λ > 0$ .

Reveal solution

Expand the objective: $J (β) = y^{⊤} y - 2 β^{⊤} X^{⊤} y + β^{⊤} (X^{⊤} X + λ I) β$ . Using the matrix-calculus identities $\nabla_{β} a^{⊤} β = a$ and $\nabla_{β} β^{⊤} M β = 2 M β$ for symmetric $M$ : $\nabla_{β} J = - 2 X^{⊤} y + 2 (X^{⊤} X + λ I) β = 0 \Rightarrow \hat{β} = (X^{⊤} X + λ I)^{- 1} X^{⊤} y .$ $X^{⊤} X$ is PSD with eigenvalues $\geq 0$ ; adding $λ I$ shifts them to $\geq λ > 0$ , so the matrix is strictly positive definite and hence invertible. The Hessian $2 (X^{⊤} X + λ I) ≻ 0$ confirms the critical point is a global minimum. $■$

Exercise 3Linear Models — Logistic◆◆

Gradient and Hessian of the logistic loss; prove convexity.

Let $p_{i} = σ (w^{⊤} x_{i})$ and $L (w) = - \sum_{i} [y_{i} lo g p_{i} + (1 - y_{i}) lo g (1 - p_{i})]$ . Derive $\nabla L$ and $\nabla^{2} L$ and show the Hessian is PSD.

Reveal solution

Key fact: $σ^{'} (z) = σ (z) (1 - σ (z))$ . For a single point with $z_{i} = w^{⊤} x_{i}$ : $\frac{\partial}{\partial z _{i}} [- y_{i} lo g σ (z_{i}) - (1 - y_{i}) lo g (1 - σ (z_{i}))] = σ (z_{i}) - y_{i} = p_{i} - y_{i} .$ Chain through $\partial z_{i} / \partial w = x_{i}$ and sum: $\nabla L = i \sum (p_{i} - y_{i}) x_{i} = X^{⊤} (p - y) .$ For the Hessian, differentiate again: $\partial p_{i} / \partial w = p_{i} (1 - p_{i}) x_{i}$ , giving $\nabla^{2} L = i \sum p_{i} (1 - p_{i}) x_{i} x_{i}^{⊤} = X^{⊤} D X, D = diag (p_{i} (1 - p_{i})) .$ Since $D$ has non-negative entries, $X^{⊤} D X$ is PSD: for any $v$ , $v^{⊤} X^{⊤} D X v = \sum_{i} p_{i} (1 - p_{i}) (x_{i}^{⊤} v)^{2} \geq 0$ . So $L$ is convex. $■$

Exercise 4Deep Learning — Softmax◆◆

Derivative of softmax + cross-entropy.

Let $p_{k} = e^{z_{k}} / \sum_{j} e^{z_{j}}$ and $L = - \sum_{k} y_{k} lo g p_{k}$ with one-hot $y$ . Show $\partial L / \partial z_{k} = p_{k} - y_{k}$ .

Reveal solution

First derive $\partial p_{k} / \partial z_{m}$ . Writing $S = \sum_{j} e^{z_{j}}$ : $\frac{\partial p _{k}}{\partial z _{m}} = \frac{δ _{k m} e ^{z_{k}} S - e ^{z_{k}} e ^{z_{m}}}{S ^{2}} = p_{k} (δ_{k m} - p_{m}) .$ Now, $\frac{\partial L}{\partial z _{m}} = - k \sum \frac{y _{k}}{p _{k}} \frac{\partial p _{k}}{\partial z _{m}} = - k \sum \frac{y _{k}}{p _{k}} p_{k} (δ_{k m} - p_{m}) = - y_{m} + p_{m} k \sum y_{k} .$ Since $y$ is one-hot, $\sum_{k} y_{k} = 1$ , so $\partial L / \partial z_{m} = p_{m} - y_{m}$ . This is why the softmax+CE gradient has such a clean implementation — it is literally the prediction error. $■$

Exercise 5Linear Algebra — PCA◆◆

PCA as an eigenvalue problem; relate to SVD.

Let $X \in R^{n \times p}$ be mean-centred. Show that the first principal component — the direction $v$ with $∥ v ∥ = 1$ maximising $Var (X v)$ — is the top right-singular vector of $X$ , and the explained variance equals $σ_{1}^{2} / n$ .

Reveal solution

Because $X$ is centred, $Var (X v) = \frac{1}{n} ∥ X v ∥_{2}^{2} = \frac{1}{n} v^{⊤} X^{⊤} X v$ . The maximisation $∥ v ∥ = 1 max v^{⊤} (X^{⊤} X) v$ is the Rayleigh quotient problem; its maximum equals the largest eigenvalue of $X^{⊤} X$ , attained at the corresponding eigenvector. Write $X = U Σ V^{⊤}$ (SVD). Then $X^{⊤} X = V Σ^{2} V^{⊤}$ , so the top eigenvector of $X^{⊤} X$ is $v_{1}$ (first column of $V$ ) with eigenvalue $σ_{1}^{2}$ . Hence the first PC is $v_{1}$ and its explained variance is $σ_{1}^{2} / n$ . $■$

Exercise 6Deep Learning — Backprop◆◆◆

Backpropagation through a 2-layer MLP.

Consider the network $z_{1} = W_{1} x + b_{1}$ , $h = ReLU (z_{1})$ , $z_{2} = W_{2} h + b_{2}$ , $\hat{y} = softmax (z_{2})$ , with cross-entropy loss $L$ against one-hot $y$ . Write all gradients in closed form.

Reveal solution

Output layer. By Exercise 4, $\partial L / \partial z_{2} = \hat{y} - y \equiv δ_{2}$ . Then $\frac{\partial L}{\partial W _{2}} = δ_{2} h^{⊤}, \frac{\partial L}{\partial b _{2}} = δ_{2} .$

Hidden layer. Propagate back: $\partial L / \partial h = W_{2}^{⊤} δ_{2}$ . Through the ReLU, $\partial h / \partial z_{1} = 1 [z_{1} > 0]$ (elementwise indicator, zero on the inactive units): $δ_{1} = \frac{\partial L}{\partial z _{1}} = (W_{2}^{⊤} δ_{2}) ⊙ 1 [z_{1} > 0] .$ $\frac{\partial L}{\partial W _{1}} = δ_{1} x^{⊤}, \frac{\partial L}{\partial b _{1}} = δ_{1} .$ Note the dying-ReLU pathology is visible here: once $z_{1}$ is negative, $δ_{1} = 0$ on that unit, and it never receives a gradient. $■$

Exercise 7Information — KL◆

KL divergence is not symmetric.

Give concrete $p, q$ on ${0, 1}$ such that $D_{KL} (p ∥ q) \neq = D_{KL} (q ∥ p)$ . Then interpret the difference between forward and reverse KL when fitting $q$ to multi-modal $p$ .

Reveal solution

Take $p = (0.9, 0.1)$ and $q = (0.5, 0.5)$ . Then $D_{KL} (p ∥ q) = 0.9 lo g \frac{0.9}{0.5} + 0.1 lo g \frac{0.1}{0.5} \approx 0.368,$ $D_{KL} (q ∥ p) = 0.5 lo g \frac{0.5}{0.9} + 0.5 lo g \frac{0.5}{0.1} \approx 0.510.$ Not equal. Interpretation. Forward KL $D_{KL} (p ∥ q)$ is mean-seeking / mass-covering: $q$ must put probability wherever $p$ does, so when $q$ is too simple it spreads across all modes of $p$ . Reverse KL $D_{KL} (q ∥ p)$ is mode-seeking: $q$ collapses onto one mode, because any mass where $p \approx 0$ is catastrophically penalised. Variational inference minimises reverse KL — hence VI's famous tendency to be over-confident. $■$

Exercise 8Information — CE ≡ MLE◆

Minimising cross-entropy is maximum likelihood.

Show that, for a classifier outputting $q_{θ} (y ∣ x)$ , minimising empirical cross-entropy $- \frac{1}{n} \sum_{i} lo g q_{θ} (y_{i} ∣ x_{i})$ is equivalent to maximum likelihood and to minimising $D_{KL} (p_{data} ∥ q_{θ})$ .

Reveal solution

Empirical cross-entropy is $- \frac{1}{n} \sum_{i} lo g q_{θ} (y_{i} ∣ x_{i}) = - \frac{1}{n} lo g \prod_{i} q_{θ} (y_{i} ∣ x_{i})$ . Minimising this is exactly maximising the log-likelihood $\prod_{i} q_{θ} (y_{i} ∣ x_{i})$ . Next, letting $\overset{p}{^}$ be the empirical distribution, $D_{KL} (\overset{p}{^} ∥ q_{θ}) = (x, y) \sum \overset{p}{^} (x, y) lo g \overset{p}{^} (x, y) - (x, y) \sum \overset{p}{^} (x, y) lo g q_{θ} (y ∣ x) p (x) .$ The first term is $- H (\overset{p}{^})$ , independent of $θ$ ; the second, after splitting the log, contains $- E_{\overset{p}{^}} [lo g q_{θ} (y ∣ x)]$ , which is exactly the cross-entropy loss. So argmin KL = argmin CE = argmax likelihood. $■$

Exercise 9Deep Learning — Regularization◆◆

Dropout as an implicit ensemble.

A network with $m$ dropout-able units applies independent Bernoulli $(1 - p)$ masks. Explain why dropout approximates averaging over $2^{m}$ sub-networks, and why the weight rescaling at test time is required.

Reveal solution

Each training step samples a binary mask $m \in {0, 1}^{m}$ with $m_{j} \sim Bern (1 - p)$ independently — one of $2^{m}$ possible sub-networks. The training objective is $E_{m} [L (θ; m)]$ , and SGD with single mask samples is an unbiased estimator of its gradient. Thus dropout optimizes an ensemble loss.

At test time we want the mean prediction $E_{m} [f_{θ} (x; m)]$ . A cheap approximation is to deactivate the mask and scale each activation by $(1 - p)$ — because each unit was "on" with probability $(1 - p)$ during training, its expected contribution to downstream linear combinations is $(1 - p)$ times its weight. In practice this is implemented as "inverted dropout": scale by $1/ (1 - p)$ during training, identity at test time. For a single linear layer, this is exact; for non-linear networks, it is a principled approximation to the geometric mean over sub-networks. $■$

Exercise 10Attention◆◆

Why does scaled dot-product attention divide by $d_{k}$ ?

Queries and keys have components with approximately zero mean and unit variance. Show that without the $1/ d_{k}$ factor, the variance of $q^{⊤} k$ grows with $d_{k}$ , and argue why this hurts softmax training.

Reveal solution

Assume components $q_{i}, k_{i}$ are independent with mean 0 and variance 1. Then $q^{⊤} k = \sum_{i = 1}^{d_{k}} q_{i} k_{i}$ has mean 0 and, by independence and $Var (q_{i} k_{i}) = E [q_{i}^{2}] E [k_{i}^{2}] = 1$ , variance $d_{k}$ . So the standard deviation grows as $d_{k}$ .

If logits have large magnitude, softmax concentrates its mass on the single largest entry. The derivative $\partial softmax_{i} / \partial z_{j} = p_{i} (δ_{ij} - p_{j})$ then collapses to near-zero across the board — vanishing gradients. Dividing by $d_{k}$ rescales variance back to $O (1)$ , keeping the softmax in its well-behaved regime across model sizes. $■$

Exercise 11Deep Learning — BatchNorm◆◆

Why does BatchNorm behave differently at train and test time?

Describe the exact computation of BatchNorm during training vs inference, explain why the two must differ, and identify a failure mode at small batch size.

Reveal solution

Training. For each feature (channel), compute batch mean $μ_{B}$ and variance $σ_{B}^{2}$ , normalise $\overset{x}{^} = (x - μ_{B}) / σ_{B}^{2} + ε$ , then affine-transform $y = γ \overset{x}{^} + β$ with learned $γ, β$ . The batch statistics couple the examples in a batch and introduce stochasticity (a mild regularizer).

Inference. Use running-average estimates $\overset{μ}{^}, \overset{σ}{^}^{2}$ accumulated during training (typically via exponential moving average). This must differ from training because (i) at inference, batches may be of size 1, making batch statistics meaningless, and (ii) predictions would otherwise depend on which other examples happen to be in the batch — unacceptable.

Failure mode. Small batches (say, 2–8) give high-variance $σ_{B}^{2}$ , so training becomes noisy and the running averages accumulated for inference become unreliable — the classic "BN breaks for detection with large backbones". Remedies: GroupNorm, LayerNorm, or SyncBN across devices. $■$

Exercise 12RNN — Vanishing Gradients◆◆◆

Why do vanilla RNN gradients vanish, and how does the LSTM cell fix it?

Derive the gradient of the loss at time $T$ w.r.t. a hidden state at time $t < T$ for a vanilla RNN and contrast the form with the LSTM cell-state path.

Reveal solution

Vanilla RNN. With $h_{t + 1} = tanh (W h_{t} + U x_{t + 1})$ , $\frac{\partial h _{t + 1}}{\partial h _{t}} = diag (tanh^{'} (\cdot)) W .$ The gradient through $T - t$ steps is the product $\prod_{k = t}^{T - 1} diag (tanh^{'} (\cdot)) W$ . Since $∣ tanh^{'} ∣ \leq 1$ and typically $< 1$ , and $W$ has spectral radius that is either $< 1$ (vanishing) or $> 1$ (exploding), the norm of this product tends to zero or infinity exponentially in $T - t$ . Long-range dependencies cannot be learned.

LSTM cell. The key is the cell update $c_{t} = f_{t} ⊙ c_{t - 1} + i_{t} ⊙ \tilde{c}_{t}$ . Differentiating, $\frac{\partial c _{t}}{\partial c _{t - 1}} = diag (f_{t}) + (smaller terms via gates) .$ Provided the forget gate $f_{t}$ stays close to 1, the product $\prod_{k} f_{k}$ does not vanish, and the gradient path through the cell state is an (almost) straight linear pipe — comparable to a residual connection. This is what enables long-range dependencies. $■$

Exercise 13VAE — ELBO◆◆◆

Derive the evidence lower bound.

For latent-variable model $p_{θ} (x, z) = p_{θ} (x ∣ z) p (z)$ and any distribution $q_{ϕ} (z ∣ x)$ , show $lo g p_{θ} (x) = E_{q_{ϕ}} [lo g p_{θ} (x ∣ z)] - D_{KL} (q_{ϕ} (z ∣ x) ∥ p (z)) + D_{KL} (q_{ϕ} (z ∣ x) ∥ p_{θ} (z ∣ x)) .$ Conclude the first two terms form a lower bound on $lo g p_{θ} (x)$ .

Reveal solution

Start from $lo g p_{θ} (x) = lo g \int p_{θ} (x, z) d z$ . Multiply and divide by $q_{ϕ} (z ∣ x)$ : $lo g p_{θ} (x) = lo g E_{q_{ϕ}} [\frac{p _{θ} ( x , z )}{q _{ϕ} ( z ∣ x )}] .$ Alternatively, without Jensen, write $lo g p_{θ} (x) = E_{q_{ϕ}} [lo g p_{θ} (x)]$ (the inner is constant in $z$ ) and use $p_{θ} (x) = p_{θ} (x, z) / p_{θ} (z ∣ x)$ : $E_{q_{ϕ}} [lo g p_{θ} (x, z)] - E_{q_{ϕ}} [lo g p_{θ} (z ∣ x)] .$ Insert $\pm lo g q_{ϕ} (z ∣ x)$ and regroup: $= ELBO E_{q_{ϕ}} [lo g p_{θ} (x ∣ z)] - D_{KL} (q_{ϕ} ∥ p) + D_{KL} (q_{ϕ} (z ∣ x) ∥ p_{θ} (z ∣ x)) .$ Since the final KL is non-negative, the ELBO is a lower bound, tight iff $q_{ϕ} = p_{θ} (\cdot ∣ x)$ . Maximising the ELBO jointly in $θ, ϕ$ thus maximises a bound on data likelihood and drives $q_{ϕ}$ towards the true posterior. $■$

Exercise 14Optimization — Adam◆◆

Why the bias correction in Adam?

In Adam, $m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) g_{t}$ , with $m_{0} = 0$ . Show that if $E [g_{t}] = g$ is constant, $E [m_{t}] = (1 - β_{1}^{t}) g$ . Hence explain why we divide by $(1 - β_{1}^{t})$ .

Reveal solution

Unrolling the recursion: $m_{t} = (1 - β_{1}) k = 1 \sum t β_{1}^{t - k} g_{k} .$ Taking expectation with $E [g_{k}] = g$ : $E [m_{t}] = (1 - β_{1}) g k = 1 \sum t β_{1}^{t - k} = (1 - β_{1}) \cdot \frac{1 - β _{1}^{t}}{1 - β _{1}} g = (1 - β_{1}^{t}) g .$ Early in training (small $t$ ), $(1 - β_{1}^{t})$ is small, so $m_{t}$ underestimates $g$ — the moment is biased toward zero because of the zero initialisation. Dividing by $(1 - β_{1}^{t})$ produces an unbiased estimate $\hat{m}_{t}$ . The same logic applies to $v_{t}$ with $β_{2}$ . Without this correction Adam would take unnecessarily small steps in its first few hundred iterations. $■$

Exercise 15Quant — Sharpe Annualization◆

Annualise a daily Sharpe ratio.

A strategy's daily excess returns have mean $μ_{d} = 4 bps$ and standard deviation $σ_{d} = 80 bps$ . Assuming 252 trading days and i.i.d. returns, compute the daily and annual Sharpe ratios. Then discuss what breaks if returns are autocorrelated.

Reveal solution

Daily Sharpe $= μ_{d} / σ_{d} = 4/80 = 0.05$ . Under i.i.d. summation, annual mean $= 252 μ_{d}$ and annual variance $= 252 σ_{d}^{2}$ , so annual $σ = 252 σ_{d}$ and $SR_{annual} = \frac{252 μ _{d}}{252 σ _{d}} = 252 SR_{d} \approx 15.87 \times 0.05 \approx 0.79.$ With autocorrelation. $Var (\sum_{t = 1}^{T} r_{t}) = T σ^{2} (1 + 2 \sum_{k = 1}^{T - 1} (1 - k / T) ρ_{k})$ . Positive autocorrelation inflates the variance; naive $252$ scaling overstates the Sharpe. Use an effective sample size or the Newey–West standard error and aggregate at an appropriate horizon. Strategies with strong serial correlation (e.g. trend-following on a single asset) are routinely reported with overstated Sharpes. $■$

Exercise 16Quant — Portfolio◆◆

Derive the tangency (max-Sharpe) portfolio.

Given excess returns $μ$ , covariance $Σ ≻ 0$ , find $w$ maximising $SR (w) = w^{⊤} μ / w^{⊤} Σ w$ .

Reveal solution

SR is invariant under $w \to c w$ for $c > 0$ , so maximising SR is equivalent to $min \frac{1}{2} w^{⊤} Σ w$ subject to $w^{⊤} μ = 1$ (a convenient normalisation). Lagrangian $L = \frac{1}{2} w^{⊤} Σ w - λ (w^{⊤} μ - 1)$ : $\nabla_{w} L = Σ w - λ μ = 0 \Rightarrow w = λ Σ^{- 1} μ .$ Applying the normalization $w^{⊤} μ = 1$ : $λ = 1/ (μ^{⊤} Σ^{- 1} μ)$ , so $w^{⋆} \propto Σ^{- 1} μ$ . With a budget constraint $1^{⊤} w = 1$ : $w^{⋆} = \frac{Σ ^{- 1} μ}{1 ^{⊤} Σ ^{- 1} μ} .$ The resulting maximum Sharpe is $μ^{⊤} Σ^{- 1} μ$ . $■$

Caveat. In practice, $μ$ and $Σ$ must be estimated; estimation error in $μ$ dominates, and $Σ^{- 1}$ amplifies noise. This is why out-of-sample MV portfolios routinely underperform equal-weighted — the "Markowitz optimization enigma". Shrink $μ$ , use factor-model $Σ$ , or constrain the problem.

Exercise 17Quant — Leakage◆◆

Spot the leakage.

A researcher fits a daily return predictor by (a) standardising each feature using its full-sample mean and variance; (b) shuffling the rows; (c) running 5-fold CV. Performance is stellar in CV and awful live. Identify at least three leakages and propose a corrected pipeline.

Reveal solution

Leakages.

Full-sample standardisation. The mean and variance use future data. Correct: fit scaler on training fold only, transform test fold with those statistics.
Shuffling. Breaks temporal ordering so that training folds contain days after the test fold. Correct: use forward-chaining (expanding or rolling window) CV.
Label overlap. If the target is a forward $h$ -day return, consecutive samples share label windows — training samples on day $t - 1$ leak information about the test sample on day $t$ . Correct: purge training rows whose label window overlaps the test set, and embargo a gap of at least $h$ after the test set before resuming training.

Additional checks. Ensure feature engineering (winsorization, rank transforms, target encoding) is performed within fold. Ensure universe selection at time $t$ uses only constituents that were actually in the index at time $t$ (survivorship). Finally, evaluate sensitivity to transaction costs and report performance per sub-period. $■$

Exercise 18Quant — Kalman◆◆◆

One-dimensional Kalman filter.

Scalar random walk $x_{t} = x_{t - 1} + w_{t}$ , $w_{t} \sim N (0, Q)$ ; observations $y_{t} = x_{t} + v_{t}$ , $v_{t} \sim N (0, R)$ . Write the predict and update steps and show the filter reduces to an exponentially-weighted moving average in steady state.

Reveal solution

Predict. $\overset{x}{^}_{t ∣ t - 1} = \overset{x}{^}_{t - 1 ∣ t - 1}$ , $P_{t ∣ t - 1} = P_{t - 1 ∣ t - 1} + Q$ .

Update. Kalman gain $K_{t} = P_{t ∣ t - 1} / (P_{t ∣ t - 1} + R)$ . Posterior mean $\overset{x}{^}_{t ∣ t} = \overset{x}{^}_{t ∣ t - 1} + K_{t} (y_{t} - \overset{x}{^}_{t ∣ t - 1})$ and variance $P_{t ∣ t} = (1 - K_{t}) P_{t ∣ t - 1}$ .

Steady state. Setting $P_{t ∣ t - 1} = P_{t - 1 ∣ t - 1} + Q$ and $P_{t ∣ t} = (1 - K) (P + Q)$ , imposing $P_{t ∣ t} = P_{t - 1 ∣ t - 1} = P^{⋆}$ gives a quadratic whose positive root is $P^{⋆} = \frac{1}{2} (- Q + Q^{2} + 4 QR), K^{⋆} = \frac{P ^{⋆} + Q}{P ^{⋆} + Q + R} .$ Then $\overset{x}{^}_{t ∣ t} = (1 - K^{⋆}) \overset{x}{^}_{t - 1 ∣ t - 1} + K^{⋆} y_{t}$ — precisely an EWMA with smoothing parameter $K^{⋆}$ , governed by the signal-to-noise ratio $Q / R$ . This is why, in practice, if you cannot calibrate a full state-space model, a well-chosen EWMA already captures most of the benefit. $■$

Exercise 19Statistics — Deflated Sharpe◆◆◆

Multiple testing and the deflated Sharpe.

You tried $N = 1000$ strategies and selected the one with the highest in-sample Sharpe, $SR^{⋆} = 2.0$ over $T = 252$ days. Under the null of zero true Sharpe, estimate the expected maximum Sharpe to gauge whether $2.0$ is real.

Reveal solution

Under the null, an estimated Sharpe $\hat{SR}$ over $T$ i.i.d. Gaussian returns is approximately $N (0, 1/ T)$ — standard deviation $1/ T$ . For $T = 252$ , $sd \approx 0.063$ .

For $N = 1000$ independent null Sharpes, the expected maximum is approximately (using the Gaussian extreme-value result $E [max Z_{i}] \approx 2 lo g N$ ): $E [i max \hat{SR}_{i}] \approx 2 lo g N \cdot \frac{1}{T} = 2 lo g 1000 / 252 \approx 3.72/15.87 \approx 0.234.$ So under the null we would already expect an annualised max Sharpe around $252 \cdot 0.234 \approx 3.7$ — higher than the observed $2.0$ ! The selected strategy is, in fact, worse than what pure noise would produce over 1000 tries. This is the intuition behind the deflated Sharpe ratio (Bailey & López de Prado): subtract the expected maximum-under-null before claiming significance. Always report how many models / hyperparameter combinations were tried. $■$

Exercise 20Deep Learning — Transformers◆◆◆

Parameter and FLOP counts for a Transformer layer.

Consider a single decoder-only Transformer layer with model dimension $d$ , $h$ attention heads (each of size $d_{k} = d / h$ ), and feed-forward width $4 d$ . Count parameters and the per-token FLOPs at context length $n$ .

Reveal solution

Parameters.

Projections $W_{Q}, W_{K}, W_{V}, W_{O}$ each $d \times d$ : $4 d^{2}$ .
MLP: $W_{1} \in R^{d \times 4 d}$ and $W_{2} \in R^{4 d \times d}$ : $8 d^{2}$ .
LayerNorms + biases: $O (d)$ , negligible.

Total $\approx 12 d^{2}$ per layer; for $L$ layers, $12 L d^{2}$ . A 12-layer model with $d = 768$ : $12 \cdot 12 \cdot 76 8^{2} \approx 8.5 \times 1 0^{7}$ params plus embeddings.

Per-token FLOPs. Dominant costs, measuring a multiply-add as 2 FLOPs:

Q/K/V/O projections: $4 \cdot 2 d^{2} = 8 d^{2}$ FLOPs per token.
Attention scores $Q K^{⊤}$ : $2 n d$ per token (each of $n$ keys, $d$ -dim dot product).
Weighted sum of values: $2 n d$ per token.
MLP: $2 \cdot 2 d \cdot 4 d = 16 d^{2}$ per token.

Total $\approx 24 d^{2} + 4 n d$ per token. Attention becomes the bottleneck once $n ≳ 6 d$ — the rationale for sparse / linear / flash-attention variants and for keeping context tractable. The frequently-quoted " $6 N$ training FLOPs per token" (with $N$ the parameter count) is this same arithmetic, generalised. $■$

That is the end of the cheatsheet. Bring these ideas to your next problem; rebuild them from scratch when you doubt them.

Mathematical Foundations

Linear Algebra

Eigendecomposition

Singular Value Decomposition (SVD)

Matrix calculus — identities to memorise

Probability & Statistics

Useful distributions & conjugacies

Optimization

Information Theory

Core Machine-Learning Principles

The Bias–Variance Decomposition

High bias (underfitting)

High variance (overfitting)

Regularization

Evaluation & Model Selection

Cross-validation

Classification metrics

Regression metrics

Feature engineering checklist

Classical Algorithms

Linear Models

Ordinary Least Squares

Logistic Regression

Generalised linear models (GLMs)

Trees & Ensembles

Kernels, k-NN, Naive Bayes

Support Vector Machines

k-Nearest Neighbours

Naive Bayes

Unsupervised Learning

k-Means

Gaussian Mixture Models (EM)

Principal Component Analysis

Deep Learning: Foundations

Multi-layer perceptron

Activation functions

Backpropagation

Initialization

Optimizers

Learning-rate schedules

Regularization & Normalization

Normalization

Loss functions

Deep Learning: Architectures

Convolutional Networks (CNN)

Key designs

Recurrent Networks (RNN / LSTM / GRU)

LSTM

Attention & Transformers

Generative Models

Autoencoders & VAEs

Generative Adversarial Networks

Diffusion models

Topics for Quantitative Research

Time Series

Autocorrelation & tests

State-space models & the Kalman filter

Portfolio Theory & Risk

Mean–variance optimization

Risk & performance metrics

Covariance estimation

Backtesting: Hazards & Hygiene

Time-series cross-validation

ML for Alpha — What Actually Matters

Exercises & Solutions

Derive the bias–variance decomposition.

Derive the closed form for ridge regression.

Gradient and Hessian of the logistic loss; prove convexity.

Derivative of softmax + cross-entropy.

PCA as an eigenvalue problem; relate to SVD.

Backpropagation through a 2-layer MLP.

KL divergence is not symmetric.

Minimising cross-entropy is maximum likelihood.

Dropout as an implicit ensemble.

Why does scaled dot-product attention divide by dk​​?

Why does BatchNorm behave differently at train and test time?

Why do vanilla RNN gradients vanish, and how does the LSTM cell fix it?

Derive the evidence lower bound.

Why the bias correction in Adam?

Annualise a daily Sharpe ratio.

Why does scaled dot-product attention divide by $d_{k}$ ?