A Cheatsheet · Volume I № 2026.04

Machine Learning & Deep Learning,
in Brief.

A compact reference for the fundamentals required of machine-learning and quantitative researchers — mathematical foundations, classical methods, neural architectures, and a set of worked exercises with full solutions.

For: ML & Quant Researchers Sections: VII Exercises: 20 (with solutions) Notation: boldface vectors, capital matrices
I
Part One

Mathematical Foundations

Every method below reduces, ultimately, to four pillars: linear algebra (how we represent data and transformations), probability (how we quantify uncertainty), optimization (how we learn from data), and information theory (how we measure signal). A researcher who has internalised these pillars can read any new architecture or paper and reconstruct its logic from first principles.

Linear Algebra

Inner product
. Orthogonality: .
Norms
, , . All for are norms.
Rank
is the dimension of the column space; has .
Positive definite
for all ; equivalently all eigenvalues are positive.

Eigendecomposition

For symmetric there exists an orthogonal and diagonal such that , with eigenvectors as columns of and eigenvalues on the diagonal of . Symmetric matrices have real eigenvalues and orthogonal eigenvectors.

Singular Value Decomposition (SVD)

For any , with , orthogonal and diagonal with singular values . SVD gives the best rank- approximation in Frobenius and spectral norms (Eckart–Young):

Why it mattersSVD is the computational core of PCA, matrix completion, low-rank regression, least-squares via pseudo-inverse , and numerically-stable solutions to ill-conditioned problems.

Matrix calculus — identities to memorise

ExpressionGradient w.r.t.
; if symmetric
(w.r.t. )
(w.r.t. )

Probability & Statistics

Bayes' rule is the only rational procedure for updating beliefs given evidence; all probabilistic ML is a specialization of it.

Bayes' rule. . The denominator is the marginal likelihood, .

Maximum likelihood (MLE). . Asymptotically normal with variance given by the inverse Fisher information .

Maximum a posteriori (MAP). . Gaussian prior regularization; Laplace prior .

Useful distributions & conjugacies

LikelihoodConjugate priorPosterior form
Bernoulli / BinomialBetaBeta
Gaussian (known )GaussianGaussian with precision-weighted mean
Gaussian (unknown )Normal–Inverse-GammaNormal–Inverse-Gamma
MultinomialDirichletDirichlet with added counts
PoissonGammaGamma with added counts / rate

Expectation, variance, covariance. . For independent : . Covariance matrix is symmetric PSD.

Law of Large Numbers & CLT. For i.i.d. with mean , variance : a.s. and .

Delta method. If and is differentiable at , then .

Optimization

Convexity. is convex iff . A twice-differentiable is convex iff . For convex problems, every local min is global.

Gradient descent. . For -smooth convex (i.e. is -Lipschitz), choosing guarantees . For -strongly convex, linear convergence .

Stochastic gradient descent. Replaces with an unbiased mini-batch estimator , . Requires decreasing step sizes , (Robbins–Monro) for convergence of the iterates under convexity and bounded variance.

Constrained optimization — KKT. For s.t. , , the Lagrangian is . At an optimum (under constraint qualifications): stationarity, primal/dual feasibility, and complementary slackness .

Researcher's ruleIf you can recognise a problem as convex, choose the method that exploits its structure (closed-form, second-order, dual). Non-convex methods should be your last resort, not the default.

Information Theory

Entropy
. Concave; maximised by uniform.
Cross-entropy
.
KL divergence
, non-symmetric.
Mutual information
.

Why cross-entropy is the classification loss. Minimising is equivalent to minimising , since is constant in . This is maximum likelihood.

Jensen's inequality. For convex : . Underlies the ELBO, importance sampling bounds, and most variational arguments.

II
Part Two

Core Machine-Learning Principles

We describe the invariants that apply to every model — the decomposition of error, the shape of regularization, and the vocabulary of evaluation.

The Bias–Variance Decomposition

For a point and a learning algorithm producing on random training sets:

High bias (underfitting)

  • Training and validation error both high
  • Model class too restrictive or under-trained
  • Fix: richer class, more features, less regularization

High variance (overfitting)

  • Low training error, high validation error
  • Model memorises noise
  • Fix: more data, regularization, simpler model, ensembling
Modern caveatFor very over-parameterized models (large NNs), the classical U-curve can give way to double descent: test error rises at the interpolation threshold, then falls again as capacity grows further. The decomposition still applies — implicit regularization of the optimizer controls variance.

Regularization

Ridge (). Minimise . Closed form: . Always well-conditioned; shrinks coefficients uniformly.

Lasso (). Minimise . No closed form; solved via coordinate descent or proximal gradient. Induces sparsity — exact zeros — because the ball has corners along the coordinate axes.

Elastic net. . Retains sparsity of Lasso while handling correlated features (groups them rather than arbitrarily picking one).

Equivalences. penalty ≡ Gaussian prior on weights; ≡ Laplace prior; data augmentation ≡ marginalising over an implicit prior.

Evaluation & Model Selection

Cross-validation

  • k-Fold — default for i.i.d. tabular data; pick .
  • Stratified k-Fold — preserves class balance in classification.
  • Leave-one-out (LOO) — low bias, high variance; expensive.
  • Nested CV — outer loop estimates generalisation, inner loop tunes hyperparameters; avoids optimistic bias from reusing the same data for both.
  • Time-series CV — use forward chaining (expanding/rolling window). Never shuffle.

Classification metrics

MetricDefinitionWhen to use
AccuracyBalanced classes only
PrecisionCost of false positives is high
RecallCost of false negatives is high
Imbalanced, single summary
ROC-AUCThreshold-free ranking quality
PR-AUCArea under Precision–RecallVery imbalanced (rare positives)
LogLossProbabilistic, calibration-sensitive

Regression metrics

MSE (penalises outliers quadratically), MAE (robust), RMSE (same units as ), , MAPE (scale-free but undefined at ), Pearson / Spearman / Kendall correlation — indispensable in quant research where rank of predictions matters more than absolute values.

Common pitfallAccuracy is nearly meaningless at class imbalance of, say, 99:1. A constant predictor reaches 99%. Always pair it with precision/recall, ROC-AUC, or PR-AUC depending on the problem.

Feature engineering checklist

  • Scaling: Standardise or min-max for distance-based models (kNN, SVM, k-means) and any gradient-based optimization (faster, more stable convergence). Trees are scale-invariant.
  • Encoding: One-hot for low-cardinality categoricals; target/mean encoding with out-of-fold statistics for high-cardinality (prevents target leakage).
  • Missing data: Explicit missing indicator + imputation beats silent imputation, since "missing" is often informative.
  • Leakage audit: Any feature that would not be available at prediction time is leakage. The most expensive bugs in ML are silent leakages.
III
Part Three

Classical Algorithms

Before neural networks, these are the tools that set the baselines. In quantitative research, linear methods and gradient-boosted trees still dominate tabular tasks — they are compact, interpretable, and hard to beat with modest data.

Linear Models

Ordinary Least Squares

. Assumes linearity, no multicollinearity, homoscedastic and uncorrelated errors. The Gauss–Markov theorem: OLS is BLUE (Best Linear Unbiased Estimator) under these assumptions.

Logistic Regression

Models with . Negative log-likelihood:

Gradient: . Hessian is PSD convex; optimizable by IRLS / L-BFGS / SGD.

Multinomial / softmax. . One coefficient vector is redundant (shift invariance); often fix .

Generalised linear models (GLMs)

Exponential-family likelihood with link function : . Linear regression (, Gaussian), logistic (, Bernoulli), Poisson regression (, Poisson). Inference via IRLS.

Trees & Ensembles

Decision trees recursively split on the feature and threshold that maximise impurity reduction. Classification uses Gini or entropy ; regression uses variance reduction. Single trees overfit; depth and leaf-count are the main regularisers.

Trees are scale-invariant, handle mixed types and missing values natively, and are the only mainstream family that can represent non-smooth, interaction-heavy functions without feature engineering.

Random Forests. Bagging: train trees on bootstrap samples, and at each split consider only a random of features. Variance-reduction ensemble: , where is the between-tree correlation. Feature subsampling lowers .

Gradient Boosting. Fit tree to the negative gradient of the loss w.r.t. the current prediction : . With squared loss, the gradient is the residual; with logistic loss, it is . Implementations (XGBoost, LightGBM, CatBoost) add second-order terms, regularized leaves, histogram splits, and native categorical handling.

Bagging / RFBoosting / GBM
Primary effectReduces varianceReduces bias
Base learnerDeep treesShallow trees (stumps to depth 6)
ParallelismEmbarrassingly parallelSequential in trees
Overfitting riskLow with enough treesHigh; controlled by , early stopping, depth

Kernels, k-NN, Naive Bayes

Support Vector Machines

Maximise the margin subject to , . Primal:

Dual only depends on inner products , enabling the kernel trick: replace with corresponding to an inner product in some RKHS.

Common kernels: linear ; polynomial ; RBF ; Matérn (GP-friendly).

k-Nearest Neighbours

No training; classify by majority among nearest points. Curse of dimensionality: in high , distances concentrate and all points become roughly equidistant. Useful for small, low- problems and as a local baseline.

Naive Bayes

Assumes features are conditionally independent given class: . Decision: . Despite the crude independence assumption, it is an extraordinarily strong text-classification baseline.

Unsupervised Learning

k-Means

Minimises by alternating (i) assign each point to nearest centroid, (ii) update centroids to cluster means. Non-convex; use k-means++ seeding. Assumes spherical, equal-variance clusters.

Gaussian Mixture Models (EM)

. Fit via EM: E-step compute responsibilities ; M-step update by weighted MLE. EM monotonically increases the log-likelihood.

Principal Component Analysis

Find orthonormal directions of maximum variance. Center and compute top- eigenvectors of the covariance — equivalently, top- right singular vectors of . The first PC maximises subject to , giving — the leading eigenvector.

t-SNE / UMAP. Non-linear embeddings for visualisation. Preserve local neighbourhoods but not global distances; never interpret cluster sizes or between-cluster distances literally.

In quant researchPCA on a return covariance matrix yields statistical factors; the leading component is typically the market, the next few correspond to sector/style factors. PCA-based residualisation is the simplest device for constructing market-neutral signals.
IV
Part Four

Deep Learning: Foundations

A neural network is a parameterised function built by composing affine maps and non-linearities. Everything below — activations, initializations, normalizations, optimizers — exists to make the composition of very many such layers trainable at scale.

Multi-layer perceptron

, . Universal approximation: a single hidden layer with enough units can approximate any continuous function on a compact set. In practice, depth buys far more expressivity per parameter than width.

Activation functions

NameFormulaNotes
SigmoidSaturates; vanishing gradients. Output layer for binary.
TanhZero-centered; still saturates.
ReLUSparse, non-saturating for . Dying-ReLU risk.
Leaky ReLU / PReLUSmall slope for keeps gradient alive.
GELUSmooth ReLU. Standard in Transformers.
Swish / SiLUSmooth, self-gated.
SoftmaxMulti-class output only.

Backpropagation

Backprop is reverse-mode automatic differentiation applied to a DAG of differentiable primitives. For each node, given , compute and by the chain rule. For the MLP above, with :

Vanishing & exploding gradients. Products of Jacobian norms either collapse to zero or diverge as depth grows. Mitigations: careful initialization, normalization layers, residual connections, gradient clipping.

Initialization

Xavier / Glorot
. Suited to tanh / sigmoid.
He / Kaiming
. Suited to ReLU.
Orthogonal
a random orthogonal matrix; preserves activation norms through depth. Good for RNNs.

Optimizers

All practical optimizers for deep learning fit the pattern: maintain a state of first- and second-moment estimates of the gradient, apply scaled updates, optionally decouple weight decay.

OptimizerUpdate (schematic)Notes
SGDOften best for convolutional vision, paired with momentum & schedule.
SGD + MomentumSmooths gradient direction; .
NesterovEvaluates at the look-ahead pointTheoretical improvement; modest in practice.
RMSPropPer-parameter scaling by gradient magnitude.
AdamMomentum + RMSProp + bias correction.
AdamWAs Adam but Decoupled weight decay; current default for Transformers.

Learning-rate schedules

  • Step / exponential decay — classical, robust.
  • Cosine annealing — smooth, widely used.
  • Warmup — linearly ramp for the first few thousand steps; essential for Transformers, where early updates are otherwise catastrophic.
  • One-cycle — warm up, then cosine down; enables aggressive max LRs.

Regularization & Normalization

Weight decay
Adds to the loss; equivalent to Gaussian prior. Implement as decoupled (AdamW) for adaptive optimizers.
Dropout
Randomly zero each activation with probability at training time, scale by . Approximates averaging over an exponential ensemble of sub-networks.
Early stopping
Halt when validation loss stops improving. Implicit regularization equivalent to an ball in some linear settings.
Data augmentation
Learn invariances (flips, crops, MixUp, CutMix, noise). Essentially the most effective regularizer in computer vision.
Label smoothing
Replace one-hot target with . Prevents overconfident logits; improves calibration.

Normalization

LayerNormalizes overTypical use
BatchNormBatch dim, per channelCNNs; requires large, stable batches.
LayerNormFeature dim, per exampleTransformers, RNNs; batch-size agnostic.
GroupNormGroups of channels, per exampleSmall batches, detection, segmentation.
RMSNormFeature dim, no mean centringModern LLMs; cheaper, performs well.
Classical vs modern viewBatchNorm was originally motivated as reducing "internal covariate shift" — later work argued its real benefit is smoothing the loss landscape. Either way: the effect is a significantly higher tolerated learning rate and faster training.

Loss functions

MSE
. MLE under Gaussian noise.
MAE / Huber
Robust to outliers; Huber is quadratic near zero, linear in the tail.
Binary cross-entropy
. MLE for Bernoulli.
Categorical CE
. MLE for Categorical / softmax.
Contrastive / InfoNCE
. Lower bound on mutual information; core of self-supervised learning.
V
Part Five

Deep Learning: Architectures

Convolutional Networks (CNN)

A convolution layer applies a small shared kernel across spatial positions: . Parameter-sharing + translation equivariance are the inductive biases that make CNNs so efficient on images.

Output spatial size after a conv with kernel , padding , stride : . Receptive field grows linearly with depth for fixed ; dilations expand it exponentially.

Pooling (max, average) downsamples and builds local invariance. Modern architectures often replace pooling with strided convolutions.

Key designs

  • ResNet. . Residual connections let gradients flow through identity paths — the reason networks can go to hundreds of layers without collapsing.
  • Inception / MobileNet. Factorised / depthwise-separable convolutions: → ( depthwise) + ( pointwise). Same expressivity, a fraction of the FLOPs.
  • U-Net. Encoder–decoder with skip connections at each resolution. Dominant for segmentation; the backbone of most diffusion models.

Recurrent Networks (RNN / LSTM / GRU)

Vanilla RNN: . Backprop through time unrolls the recurrence; gradients flow through products of Jacobians, producing vanishing/exploding gradients on long sequences.

LSTM

Gated cell with explicit memory :

The additive cell update is the key: gradients propagate through without repeated multiplication by Jacobians, taming vanishing gradients. GRU is a streamlined variant with 2 gates and no separate cell state.

Attention & Transformers

Scaled dot-product attention. Given queries , keys , values :

The keeps the dot products at a scale where softmax is not saturated. Complexity is in sequence length.

Multi-head attention. Run heads in parallel with separate , concatenate outputs, project with . Each head learns a different relational pattern.

Transformer block. Pre-norm variant, the modern default:

x = x + MHA(LN(x))            # self-attention sub-layer
x = x + MLP(LN(x))            # feed-forward sub-layer, typically 4x width

Positional information. Attention is permutation-equivariant; inject order via sinusoidal embeddings, learned absolute positions, or — in modern LLMs — Rotary Position Embeddings (RoPE) applied to and .

Causal vs bidirectional. Decoder-only (GPT-style) masks future positions — used for autoregressive generation. Encoder (BERT-style) sees both directions — used for embedding / classification.

Scaling lawsFor compute-optimal training, model size and data should scale roughly proportionally (Chinchilla-style). Loss follows power laws in each with an irreducible floor; an extra order of magnitude of either gives diminishing returns unless both grow.

Generative Models

Autoencoders & VAEs

An autoencoder learns an identity through a bottleneck, . A Variational AE places a latent prior and an inference network , maximising the ELBO:

Trained by reparameterisation: with .

Generative Adversarial Networks

Minimax game between generator and discriminator :

Non-saturating generator loss, Wasserstein GAN with gradient penalty, spectral normalisation — all are stabilisations around this same objective.

Diffusion models

Forward process adds Gaussian noise: . A network is trained to predict the noise; sampling reverses the chain. Simple loss:

Variants: score-matching / SDE formulation, DDIM for deterministic sampling, classifier-free guidance for conditional generation.

VI
Part Six

Topics for Quantitative Research

Quantitative research inherits the full ML toolkit but operates under constraints foreign to most ML papers: low signal-to-noise, non-stationarity, non-i.i.d. samples, and economic costs for every false positive. The techniques below address these constraints directly.

Time Series

Stationarity
A process is (weakly) stationary if its mean and autocovariance are time-invariant. Most classical methods assume stationarity; in finance, price series are non-stationary, but returns are approximately so.
ARMA(p,q)
. AR: last lags; MA: last innovations.
ARIMA(p,d,q)
Difference the series times to achieve stationarity, then fit ARMA.
GARCH(1,1)
. Captures volatility clustering — the single most robust empirical feature of financial returns.
Cointegration
Non-stationary may have a stationary linear combination . Foundation of pairs / statistical-arbitrage strategies.

Autocorrelation & tests

ACF ; PACF is the partial autocorrelation at lag controlling for intermediate lags. Ljung–Box tests jointly if the first autocorrelations are zero. ADF / KPSS test (non-)stationarity.

State-space models & the Kalman filter

Linear Gaussian state-space model:

Kalman recursion alternates predict and update. It is the optimal linear estimator and, under Gaussian noise, the optimal estimator full stop. Used in quant for dynamic hedge ratios, time-varying betas, and signal smoothing.

Portfolio Theory & Risk

Mean–variance optimization

Minimum-variance portfolio with target return :

Closed-form solution via Lagrange multipliers. Tangency (max-Sharpe) portfolio: .

Risk & performance metrics

MetricDefinitionNotes
Sharpe ratioAnnualize by for daily.
SortinoUses downside deviation onlyPenalises only negative volatility.
Information ratioQuality of active return vs benchmark.
Max drawdownPeak-to-trough loss.
VaRQuantile; not subadditive.
CVaR / ESCoherent; average tail loss.

Covariance estimation

The sample covariance is noisy when and inverting it (as in MV optimization) amplifies that noise. Remedies: Ledoit–Wolf shrinkage ; factor-model covariance ; minimum-variance with no short-sell constraints acts as implicit regularization.

Backtesting: Hazards & Hygiene

A backtest is a hypothesis about the pastIt becomes a prediction about the future only if every source of look-ahead, selection, and multiple-testing bias has been ruled out. Most "strategies" that fail in production failed the backtest too — readers just did not notice.
  • Look-ahead bias. Using data not yet available at the trade decision time. Easily hidden in resampling, normalization, or target encodings computed on the full sample.
  • Survivorship bias. Universe includes only firms that exist today, not the dead ones. Systematically inflates returns.
  • Selection bias. Data-mining a strategy from a large search space without correcting for multiplicity. See deflated Sharpe ratio and the Bailey–López de Prado corrections.
  • Transaction costs & slippage. A signal that is alpha gross of costs can easily be zero or negative net of them.
  • Regime / non-stationarity. A backtest spanning one regime cannot validate a strategy in another. Always report performance per sub-period.
  • Purged / embargoed cross-validation. Remove observations whose label windows overlap training data; embargo a gap after each test fold to prevent leakage of serial correlation.

Time-series cross-validation

Forward-chaining expanding-window CV respects causality. For overlapping labels (e.g. forward -day returns), apply purging — remove training samples whose label window intersects the test window — and embargo — a buffer of length after the test set.

ML for Alpha — What Actually Matters

  • Signal-to-noise is the enemy. Daily returns have SNR orders of magnitude lower than images or text. Capacity of the model must match the information in the data — enormous networks overfit noise.
  • Prefer rank-based and robust losses. Squared error is dominated by outliers. Quantile, Huber, and rank-based losses (Spearman surrogates) typically generalise better for return prediction.
  • Feature stability beats peak accuracy. A signal that works moderately across sub-periods is more valuable than one that is excellent in one and useless in another.
  • Combine linear + non-linear. Linear factors remain the bulk of explanatory power; trees / NNs add interactions. Stack or residualise.
  • Bet sizing is orthogonal to signal. Kelly (or a fractional Kelly) links expected edge and variance to position size. A good signal sized poorly is a loss-making strategy.
  • Evaluate economic utility, not just statistical fit. Out-of-sample Sharpe, turnover, capacity, drawdown, and sensitivity to costs — in that order.
VII
Part Seven

Exercises & Solutions

Each exercise is labelled by topic and difficulty (◆ introductory, ◆◆ intermediate, ◆◆◆ advanced). Solutions are hidden by default — attempt the problem before revealing the worked argument. The point of these is not the answer itself but the chain of reasoning; they are chosen to exercise canonical techniques you will reuse for the rest of your career.

Exercise 1Statistics — Bias–Variance◆◆

Derive the bias–variance decomposition.

Let with and , . For a learned predictor (random over training sets), show that

Reveal solution

Let . Expand: Because is independent of with zero mean, cross terms vanish: Add and subtract inside the square: The cross term because . The two remaining terms are Bias and Variance, giving the claim.

Exercise 2Linear Models — Ridge

Derive the closed form for ridge regression.

For , , , find minimising . Show the solution is always well-defined when .

Reveal solution

Expand the objective: . Using the matrix-calculus identities and for symmetric : is PSD with eigenvalues ; adding shifts them to , so the matrix is strictly positive definite and hence invertible. The Hessian confirms the critical point is a global minimum.

Exercise 3Linear Models — Logistic◆◆

Gradient and Hessian of the logistic loss; prove convexity.

Let and . Derive and and show the Hessian is PSD.

Reveal solution

Key fact: . For a single point with : Chain through and sum: For the Hessian, differentiate again: , giving Since has non-negative entries, is PSD: for any , . So is convex.

Exercise 4Deep Learning — Softmax◆◆

Derivative of softmax + cross-entropy.

Let and with one-hot . Show .

Reveal solution

First derive . Writing : Now, Since is one-hot, , so . This is why the softmax+CE gradient has such a clean implementation — it is literally the prediction error.

Exercise 5Linear Algebra — PCA◆◆

PCA as an eigenvalue problem; relate to SVD.

Let be mean-centred. Show that the first principal component — the direction with maximising — is the top right-singular vector of , and the explained variance equals .

Reveal solution

Because is centred, . The maximisation is the Rayleigh quotient problem; its maximum equals the largest eigenvalue of , attained at the corresponding eigenvector. Write (SVD). Then , so the top eigenvector of is (first column of ) with eigenvalue . Hence the first PC is and its explained variance is .

Exercise 6Deep Learning — Backprop◆◆◆

Backpropagation through a 2-layer MLP.

Consider the network , , , , with cross-entropy loss against one-hot . Write all gradients in closed form.

Reveal solution

Output layer. By Exercise 4, . Then

Hidden layer. Propagate back: . Through the ReLU, (elementwise indicator, zero on the inactive units): Note the dying-ReLU pathology is visible here: once is negative, on that unit, and it never receives a gradient.

Exercise 7Information — KL

KL divergence is not symmetric.

Give concrete on such that . Then interpret the difference between forward and reverse KL when fitting to multi-modal .

Reveal solution

Take and . Then Not equal. Interpretation. Forward KL is mean-seeking / mass-covering: must put probability wherever does, so when is too simple it spreads across all modes of . Reverse KL is mode-seeking: collapses onto one mode, because any mass where is catastrophically penalised. Variational inference minimises reverse KL — hence VI's famous tendency to be over-confident.

Exercise 8Information — CE ≡ MLE

Minimising cross-entropy is maximum likelihood.

Show that, for a classifier outputting , minimising empirical cross-entropy is equivalent to maximum likelihood and to minimising .

Reveal solution

Empirical cross-entropy is . Minimising this is exactly maximising the log-likelihood . Next, letting be the empirical distribution, The first term is , independent of ; the second, after splitting the log, contains , which is exactly the cross-entropy loss. So argmin KL = argmin CE = argmax likelihood.

Exercise 9Deep Learning — Regularization◆◆

Dropout as an implicit ensemble.

A network with dropout-able units applies independent Bernoulli masks. Explain why dropout approximates averaging over sub-networks, and why the weight rescaling at test time is required.

Reveal solution

Each training step samples a binary mask with independently — one of possible sub-networks. The training objective is , and SGD with single mask samples is an unbiased estimator of its gradient. Thus dropout optimizes an ensemble loss.

At test time we want the mean prediction . A cheap approximation is to deactivate the mask and scale each activation by — because each unit was "on" with probability during training, its expected contribution to downstream linear combinations is times its weight. In practice this is implemented as "inverted dropout": scale by during training, identity at test time. For a single linear layer, this is exact; for non-linear networks, it is a principled approximation to the geometric mean over sub-networks.

Exercise 10Attention◆◆

Why does scaled dot-product attention divide by ?

Queries and keys have components with approximately zero mean and unit variance. Show that without the factor, the variance of grows with , and argue why this hurts softmax training.

Reveal solution

Assume components are independent with mean 0 and variance 1. Then has mean 0 and, by independence and , variance . So the standard deviation grows as .

If logits have large magnitude, softmax concentrates its mass on the single largest entry. The derivative then collapses to near-zero across the board — vanishing gradients. Dividing by rescales variance back to , keeping the softmax in its well-behaved regime across model sizes.

Exercise 11Deep Learning — BatchNorm◆◆

Why does BatchNorm behave differently at train and test time?

Describe the exact computation of BatchNorm during training vs inference, explain why the two must differ, and identify a failure mode at small batch size.

Reveal solution

Training. For each feature (channel), compute batch mean and variance , normalise , then affine-transform with learned . The batch statistics couple the examples in a batch and introduce stochasticity (a mild regularizer).

Inference. Use running-average estimates accumulated during training (typically via exponential moving average). This must differ from training because (i) at inference, batches may be of size 1, making batch statistics meaningless, and (ii) predictions would otherwise depend on which other examples happen to be in the batch — unacceptable.

Failure mode. Small batches (say, 2–8) give high-variance , so training becomes noisy and the running averages accumulated for inference become unreliable — the classic "BN breaks for detection with large backbones". Remedies: GroupNorm, LayerNorm, or SyncBN across devices.

Exercise 12RNN — Vanishing Gradients◆◆◆

Why do vanilla RNN gradients vanish, and how does the LSTM cell fix it?

Derive the gradient of the loss at time w.r.t. a hidden state at time for a vanilla RNN and contrast the form with the LSTM cell-state path.

Reveal solution

Vanilla RNN. With , The gradient through steps is the product . Since and typically , and has spectral radius that is either (vanishing) or (exploding), the norm of this product tends to zero or infinity exponentially in . Long-range dependencies cannot be learned.

LSTM cell. The key is the cell update . Differentiating, Provided the forget gate stays close to 1, the product does not vanish, and the gradient path through the cell state is an (almost) straight linear pipe — comparable to a residual connection. This is what enables long-range dependencies.

Exercise 13VAE — ELBO◆◆◆

Derive the evidence lower bound.

For latent-variable model and any distribution , show Conclude the first two terms form a lower bound on .

Reveal solution

Start from . Multiply and divide by : Alternatively, without Jensen, write (the inner is constant in ) and use : Insert and regroup: Since the final KL is non-negative, the ELBO is a lower bound, tight iff . Maximising the ELBO jointly in thus maximises a bound on data likelihood and drives towards the true posterior.

Exercise 14Optimization — Adam◆◆

Why the bias correction in Adam?

In Adam, , with . Show that if is constant, . Hence explain why we divide by .

Reveal solution

Unrolling the recursion: Taking expectation with : Early in training (small ), is small, so underestimates — the moment is biased toward zero because of the zero initialisation. Dividing by produces an unbiased estimate . The same logic applies to with . Without this correction Adam would take unnecessarily small steps in its first few hundred iterations.

Exercise 15Quant — Sharpe Annualization

Annualise a daily Sharpe ratio.

A strategy's daily excess returns have mean and standard deviation . Assuming 252 trading days and i.i.d. returns, compute the daily and annual Sharpe ratios. Then discuss what breaks if returns are autocorrelated.

Reveal solution

Daily Sharpe . Under i.i.d. summation, annual mean and annual variance , so annual and With autocorrelation. . Positive autocorrelation inflates the variance; naive scaling overstates the Sharpe. Use an effective sample size or the Newey–West standard error and aggregate at an appropriate horizon. Strategies with strong serial correlation (e.g. trend-following on a single asset) are routinely reported with overstated Sharpes.

Exercise 16Quant — Portfolio◆◆

Derive the tangency (max-Sharpe) portfolio.

Given excess returns , covariance , find maximising .

Reveal solution

SR is invariant under for , so maximising SR is equivalent to subject to (a convenient normalisation). Lagrangian : Applying the normalization : , so . With a budget constraint : The resulting maximum Sharpe is .

Caveat. In practice, and must be estimated; estimation error in dominates, and amplifies noise. This is why out-of-sample MV portfolios routinely underperform equal-weighted — the "Markowitz optimization enigma". Shrink , use factor-model , or constrain the problem.

Exercise 17Quant — Leakage◆◆

Spot the leakage.

A researcher fits a daily return predictor by (a) standardising each feature using its full-sample mean and variance; (b) shuffling the rows; (c) running 5-fold CV. Performance is stellar in CV and awful live. Identify at least three leakages and propose a corrected pipeline.

Reveal solution

Leakages.

  1. Full-sample standardisation. The mean and variance use future data. Correct: fit scaler on training fold only, transform test fold with those statistics.
  2. Shuffling. Breaks temporal ordering so that training folds contain days after the test fold. Correct: use forward-chaining (expanding or rolling window) CV.
  3. Label overlap. If the target is a forward -day return, consecutive samples share label windows — training samples on day leak information about the test sample on day . Correct: purge training rows whose label window overlaps the test set, and embargo a gap of at least after the test set before resuming training.

Additional checks. Ensure feature engineering (winsorization, rank transforms, target encoding) is performed within fold. Ensure universe selection at time uses only constituents that were actually in the index at time (survivorship). Finally, evaluate sensitivity to transaction costs and report performance per sub-period.

Exercise 18Quant — Kalman◆◆◆

One-dimensional Kalman filter.

Scalar random walk , ; observations , . Write the predict and update steps and show the filter reduces to an exponentially-weighted moving average in steady state.

Reveal solution

Predict. , .

Update. Kalman gain . Posterior mean and variance .

Steady state. Setting and , imposing gives a quadratic whose positive root is Then — precisely an EWMA with smoothing parameter , governed by the signal-to-noise ratio . This is why, in practice, if you cannot calibrate a full state-space model, a well-chosen EWMA already captures most of the benefit.

Exercise 19Statistics — Deflated Sharpe◆◆◆

Multiple testing and the deflated Sharpe.

You tried strategies and selected the one with the highest in-sample Sharpe, over days. Under the null of zero true Sharpe, estimate the expected maximum Sharpe to gauge whether is real.

Reveal solution

Under the null, an estimated Sharpe over i.i.d. Gaussian returns is approximately — standard deviation . For , .

For independent null Sharpes, the expected maximum is approximately (using the Gaussian extreme-value result ): So under the null we would already expect an annualised max Sharpe around higher than the observed ! The selected strategy is, in fact, worse than what pure noise would produce over 1000 tries. This is the intuition behind the deflated Sharpe ratio (Bailey & López de Prado): subtract the expected maximum-under-null before claiming significance. Always report how many models / hyperparameter combinations were tried.

Exercise 20Deep Learning — Transformers◆◆◆

Parameter and FLOP counts for a Transformer layer.

Consider a single decoder-only Transformer layer with model dimension , attention heads (each of size ), and feed-forward width . Count parameters and the per-token FLOPs at context length .

Reveal solution

Parameters.

  • Projections each : .
  • MLP: and : .
  • LayerNorms + biases: , negligible.

Total per layer; for layers, . A 12-layer model with : params plus embeddings.

Per-token FLOPs. Dominant costs, measuring a multiply-add as 2 FLOPs:

  • Q/K/V/O projections: FLOPs per token.
  • Attention scores : per token (each of keys, -dim dot product).
  • Weighted sum of values: per token.
  • MLP: per token.

Total per token. Attention becomes the bottleneck once — the rationale for sparse / linear / flash-attention variants and for keeping context tractable. The frequently-quoted " training FLOPs per token" (with the parameter count) is this same arithmetic, generalised.


That is the end of the cheatsheet. Bring these ideas to your next problem; rebuild them from scratch when you doubt them.