Mathematical Foundations
Every method below reduces, ultimately, to four pillars: linear algebra (how we represent data and transformations), probability (how we quantify uncertainty), optimization (how we learn from data), and information theory (how we measure signal). A researcher who has internalised these pillars can read any new architecture or paper and reconstruct its logic from first principles.
Linear Algebra
- Inner product
- . Orthogonality: .
- Norms
- , , . All for are norms.
- Rank
- is the dimension of the column space; has .
- Positive definite
- for all ; equivalently all eigenvalues are positive.
Eigendecomposition
For symmetric there exists an orthogonal and diagonal such that , with eigenvectors as columns of and eigenvalues on the diagonal of . Symmetric matrices have real eigenvalues and orthogonal eigenvectors.
Singular Value Decomposition (SVD)
For any , with , orthogonal and diagonal with singular values . SVD gives the best rank- approximation in Frobenius and spectral norms (Eckart–Young):
Matrix calculus — identities to memorise
| Expression | Gradient w.r.t. |
|---|---|
| ; if symmetric | |
| (w.r.t. ) | |
| (w.r.t. ) |
Probability & Statistics
Bayes' rule is the only rational procedure for updating beliefs given evidence; all probabilistic ML is a specialization of it.
Bayes' rule. . The denominator is the marginal likelihood, .
Maximum likelihood (MLE). . Asymptotically normal with variance given by the inverse Fisher information .
Maximum a posteriori (MAP). . Gaussian prior regularization; Laplace prior .
Useful distributions & conjugacies
| Likelihood | Conjugate prior | Posterior form |
|---|---|---|
| Bernoulli / Binomial | Beta | Beta |
| Gaussian (known ) | Gaussian | Gaussian with precision-weighted mean |
| Gaussian (unknown ) | Normal–Inverse-Gamma | Normal–Inverse-Gamma |
| Multinomial | Dirichlet | Dirichlet with added counts |
| Poisson | Gamma | Gamma with added counts / rate |
Expectation, variance, covariance. . For independent : . Covariance matrix is symmetric PSD.
Law of Large Numbers & CLT. For i.i.d. with mean , variance : a.s. and .
Delta method. If and is differentiable at , then .
Optimization
Convexity. is convex iff . A twice-differentiable is convex iff . For convex problems, every local min is global.
Gradient descent. . For -smooth convex (i.e. is -Lipschitz), choosing guarantees . For -strongly convex, linear convergence .
Stochastic gradient descent. Replaces with an unbiased mini-batch estimator , . Requires decreasing step sizes , (Robbins–Monro) for convergence of the iterates under convexity and bounded variance.
Constrained optimization — KKT. For s.t. , , the Lagrangian is . At an optimum (under constraint qualifications): stationarity, primal/dual feasibility, and complementary slackness .
Information Theory
- Entropy
- . Concave; maximised by uniform.
- Cross-entropy
- .
- KL divergence
- , non-symmetric.
- Mutual information
- .
Why cross-entropy is the classification loss. Minimising is equivalent to minimising , since is constant in . This is maximum likelihood.
Jensen's inequality. For convex : . Underlies the ELBO, importance sampling bounds, and most variational arguments.
Core Machine-Learning Principles
We describe the invariants that apply to every model — the decomposition of error, the shape of regularization, and the vocabulary of evaluation.
The Bias–Variance Decomposition
For a point and a learning algorithm producing on random training sets:
High bias (underfitting)
- Training and validation error both high
- Model class too restrictive or under-trained
- Fix: richer class, more features, less regularization
High variance (overfitting)
- Low training error, high validation error
- Model memorises noise
- Fix: more data, regularization, simpler model, ensembling
Regularization
Ridge (). Minimise . Closed form: . Always well-conditioned; shrinks coefficients uniformly.
Lasso (). Minimise . No closed form; solved via coordinate descent or proximal gradient. Induces sparsity — exact zeros — because the ball has corners along the coordinate axes.
Elastic net. . Retains sparsity of Lasso while handling correlated features (groups them rather than arbitrarily picking one).
Equivalences. penalty ≡ Gaussian prior on weights; ≡ Laplace prior; data augmentation ≡ marginalising over an implicit prior.
Evaluation & Model Selection
Cross-validation
- k-Fold — default for i.i.d. tabular data; pick .
- Stratified k-Fold — preserves class balance in classification.
- Leave-one-out (LOO) — low bias, high variance; expensive.
- Nested CV — outer loop estimates generalisation, inner loop tunes hyperparameters; avoids optimistic bias from reusing the same data for both.
- Time-series CV — use forward chaining (expanding/rolling window). Never shuffle.
Classification metrics
| Metric | Definition | When to use |
|---|---|---|
| Accuracy | Balanced classes only | |
| Precision | Cost of false positives is high | |
| Recall | Cost of false negatives is high | |
| Imbalanced, single summary | ||
| ROC-AUC | Threshold-free ranking quality | |
| PR-AUC | Area under Precision–Recall | Very imbalanced (rare positives) |
| LogLoss | Probabilistic, calibration-sensitive |
Regression metrics
MSE (penalises outliers quadratically), MAE (robust), RMSE (same units as ), , MAPE (scale-free but undefined at ), Pearson / Spearman / Kendall correlation — indispensable in quant research where rank of predictions matters more than absolute values.
Feature engineering checklist
- Scaling: Standardise or min-max for distance-based models (kNN, SVM, k-means) and any gradient-based optimization (faster, more stable convergence). Trees are scale-invariant.
- Encoding: One-hot for low-cardinality categoricals; target/mean encoding with out-of-fold statistics for high-cardinality (prevents target leakage).
- Missing data: Explicit missing indicator + imputation beats silent imputation, since "missing" is often informative.
- Leakage audit: Any feature that would not be available at prediction time is leakage. The most expensive bugs in ML are silent leakages.
Classical Algorithms
Before neural networks, these are the tools that set the baselines. In quantitative research, linear methods and gradient-boosted trees still dominate tabular tasks — they are compact, interpretable, and hard to beat with modest data.
Linear Models
Ordinary Least Squares
. Assumes linearity, no multicollinearity, homoscedastic and uncorrelated errors. The Gauss–Markov theorem: OLS is BLUE (Best Linear Unbiased Estimator) under these assumptions.
Logistic Regression
Models with . Negative log-likelihood:
Gradient: . Hessian is PSD convex; optimizable by IRLS / L-BFGS / SGD.
Multinomial / softmax. . One coefficient vector is redundant (shift invariance); often fix .
Generalised linear models (GLMs)
Exponential-family likelihood with link function : . Linear regression (, Gaussian), logistic (, Bernoulli), Poisson regression (, Poisson). Inference via IRLS.
Trees & Ensembles
Decision trees recursively split on the feature and threshold that maximise impurity reduction. Classification uses Gini or entropy ; regression uses variance reduction. Single trees overfit; depth and leaf-count are the main regularisers.
Trees are scale-invariant, handle mixed types and missing values natively, and are the only mainstream family that can represent non-smooth, interaction-heavy functions without feature engineering.
Random Forests. Bagging: train trees on bootstrap samples, and at each split consider only a random of features. Variance-reduction ensemble: , where is the between-tree correlation. Feature subsampling lowers .
Gradient Boosting. Fit tree to the negative gradient of the loss w.r.t. the current prediction : . With squared loss, the gradient is the residual; with logistic loss, it is . Implementations (XGBoost, LightGBM, CatBoost) add second-order terms, regularized leaves, histogram splits, and native categorical handling.
| Bagging / RF | Boosting / GBM | |
|---|---|---|
| Primary effect | Reduces variance | Reduces bias |
| Base learner | Deep trees | Shallow trees (stumps to depth 6) |
| Parallelism | Embarrassingly parallel | Sequential in trees |
| Overfitting risk | Low with enough trees | High; controlled by , early stopping, depth |
Kernels, k-NN, Naive Bayes
Support Vector Machines
Maximise the margin subject to , . Primal:
Dual only depends on inner products , enabling the kernel trick: replace with corresponding to an inner product in some RKHS.
Common kernels: linear ; polynomial ; RBF ; Matérn (GP-friendly).
k-Nearest Neighbours
No training; classify by majority among nearest points. Curse of dimensionality: in high , distances concentrate and all points become roughly equidistant. Useful for small, low- problems and as a local baseline.
Naive Bayes
Assumes features are conditionally independent given class: . Decision: . Despite the crude independence assumption, it is an extraordinarily strong text-classification baseline.
Unsupervised Learning
k-Means
Minimises by alternating (i) assign each point to nearest centroid, (ii) update centroids to cluster means. Non-convex; use k-means++ seeding. Assumes spherical, equal-variance clusters.
Gaussian Mixture Models (EM)
. Fit via EM: E-step compute responsibilities ; M-step update by weighted MLE. EM monotonically increases the log-likelihood.
Principal Component Analysis
Find orthonormal directions of maximum variance. Center and compute top- eigenvectors of the covariance — equivalently, top- right singular vectors of . The first PC maximises subject to , giving — the leading eigenvector.
t-SNE / UMAP. Non-linear embeddings for visualisation. Preserve local neighbourhoods but not global distances; never interpret cluster sizes or between-cluster distances literally.
Deep Learning: Foundations
A neural network is a parameterised function built by composing affine maps and non-linearities. Everything below — activations, initializations, normalizations, optimizers — exists to make the composition of very many such layers trainable at scale.
Multi-layer perceptron
, . Universal approximation: a single hidden layer with enough units can approximate any continuous function on a compact set. In practice, depth buys far more expressivity per parameter than width.
Activation functions
| Name | Formula | Notes |
|---|---|---|
| Sigmoid | Saturates; vanishing gradients. Output layer for binary. | |
| Tanh | Zero-centered; still saturates. | |
| ReLU | Sparse, non-saturating for . Dying-ReLU risk. | |
| Leaky ReLU / PReLU | Small slope for keeps gradient alive. | |
| GELU | Smooth ReLU. Standard in Transformers. | |
| Swish / SiLU | Smooth, self-gated. | |
| Softmax | Multi-class output only. |
Backpropagation
Backprop is reverse-mode automatic differentiation applied to a DAG of differentiable primitives. For each node, given , compute and by the chain rule. For the MLP above, with :
Vanishing & exploding gradients. Products of Jacobian norms either collapse to zero or diverge as depth grows. Mitigations: careful initialization, normalization layers, residual connections, gradient clipping.
Initialization
- Xavier / Glorot
- . Suited to tanh / sigmoid.
- He / Kaiming
- . Suited to ReLU.
- Orthogonal
- a random orthogonal matrix; preserves activation norms through depth. Good for RNNs.
Optimizers
All practical optimizers for deep learning fit the pattern: maintain a state of first- and second-moment estimates of the gradient, apply scaled updates, optionally decouple weight decay.
| Optimizer | Update (schematic) | Notes |
|---|---|---|
| SGD | Often best for convolutional vision, paired with momentum & schedule. | |
| SGD + Momentum | Smooths gradient direction; . | |
| Nesterov | Evaluates at the look-ahead point | Theoretical improvement; modest in practice. |
| RMSProp | Per-parameter scaling by gradient magnitude. | |
| Adam | Momentum + RMSProp + bias correction. | |
| AdamW | As Adam but | Decoupled weight decay; current default for Transformers. |
Learning-rate schedules
- Step / exponential decay — classical, robust.
- Cosine annealing — smooth, widely used.
- Warmup — linearly ramp for the first few thousand steps; essential for Transformers, where early updates are otherwise catastrophic.
- One-cycle — warm up, then cosine down; enables aggressive max LRs.
Regularization & Normalization
- Weight decay
- Adds to the loss; equivalent to Gaussian prior. Implement as decoupled (AdamW) for adaptive optimizers.
- Dropout
- Randomly zero each activation with probability at training time, scale by . Approximates averaging over an exponential ensemble of sub-networks.
- Early stopping
- Halt when validation loss stops improving. Implicit regularization equivalent to an ball in some linear settings.
- Data augmentation
- Learn invariances (flips, crops, MixUp, CutMix, noise). Essentially the most effective regularizer in computer vision.
- Label smoothing
- Replace one-hot target with . Prevents overconfident logits; improves calibration.
Normalization
| Layer | Normalizes over | Typical use |
|---|---|---|
| BatchNorm | Batch dim, per channel | CNNs; requires large, stable batches. |
| LayerNorm | Feature dim, per example | Transformers, RNNs; batch-size agnostic. |
| GroupNorm | Groups of channels, per example | Small batches, detection, segmentation. |
| RMSNorm | Feature dim, no mean centring | Modern LLMs; cheaper, performs well. |
Loss functions
- MSE
- . MLE under Gaussian noise.
- MAE / Huber
- Robust to outliers; Huber is quadratic near zero, linear in the tail.
- Binary cross-entropy
- . MLE for Bernoulli.
- Categorical CE
- . MLE for Categorical / softmax.
- Contrastive / InfoNCE
- . Lower bound on mutual information; core of self-supervised learning.
Deep Learning: Architectures
Convolutional Networks (CNN)
A convolution layer applies a small shared kernel across spatial positions: . Parameter-sharing + translation equivariance are the inductive biases that make CNNs so efficient on images.
Output spatial size after a conv with kernel , padding , stride : . Receptive field grows linearly with depth for fixed ; dilations expand it exponentially.
Pooling (max, average) downsamples and builds local invariance. Modern architectures often replace pooling with strided convolutions.
Key designs
- ResNet. . Residual connections let gradients flow through identity paths — the reason networks can go to hundreds of layers without collapsing.
- Inception / MobileNet. Factorised / depthwise-separable convolutions: → ( depthwise) + ( pointwise). Same expressivity, a fraction of the FLOPs.
- U-Net. Encoder–decoder with skip connections at each resolution. Dominant for segmentation; the backbone of most diffusion models.
Recurrent Networks (RNN / LSTM / GRU)
Vanilla RNN: . Backprop through time unrolls the recurrence; gradients flow through products of Jacobians, producing vanishing/exploding gradients on long sequences.
LSTM
Gated cell with explicit memory :
The additive cell update is the key: gradients propagate through without repeated multiplication by Jacobians, taming vanishing gradients. GRU is a streamlined variant with 2 gates and no separate cell state.
Attention & Transformers
Scaled dot-product attention. Given queries , keys , values :
The keeps the dot products at a scale where softmax is not saturated. Complexity is in sequence length.
Multi-head attention. Run heads in parallel with separate , concatenate outputs, project with . Each head learns a different relational pattern.
Transformer block. Pre-norm variant, the modern default:
x = x + MHA(LN(x)) # self-attention sub-layer x = x + MLP(LN(x)) # feed-forward sub-layer, typically 4x width
Positional information. Attention is permutation-equivariant; inject order via sinusoidal embeddings, learned absolute positions, or — in modern LLMs — Rotary Position Embeddings (RoPE) applied to and .
Causal vs bidirectional. Decoder-only (GPT-style) masks future positions — used for autoregressive generation. Encoder (BERT-style) sees both directions — used for embedding / classification.
Generative Models
Autoencoders & VAEs
An autoencoder learns an identity through a bottleneck, . A Variational AE places a latent prior and an inference network , maximising the ELBO:
Trained by reparameterisation: with .
Generative Adversarial Networks
Minimax game between generator and discriminator :
Non-saturating generator loss, Wasserstein GAN with gradient penalty, spectral normalisation — all are stabilisations around this same objective.
Diffusion models
Forward process adds Gaussian noise: . A network is trained to predict the noise; sampling reverses the chain. Simple loss:
Variants: score-matching / SDE formulation, DDIM for deterministic sampling, classifier-free guidance for conditional generation.
Topics for Quantitative Research
Quantitative research inherits the full ML toolkit but operates under constraints foreign to most ML papers: low signal-to-noise, non-stationarity, non-i.i.d. samples, and economic costs for every false positive. The techniques below address these constraints directly.
Time Series
- Stationarity
- A process is (weakly) stationary if its mean and autocovariance are time-invariant. Most classical methods assume stationarity; in finance, price series are non-stationary, but returns are approximately so.
- ARMA(p,q)
- . AR: last lags; MA: last innovations.
- ARIMA(p,d,q)
- Difference the series times to achieve stationarity, then fit ARMA.
- GARCH(1,1)
- . Captures volatility clustering — the single most robust empirical feature of financial returns.
- Cointegration
- Non-stationary may have a stationary linear combination . Foundation of pairs / statistical-arbitrage strategies.
Autocorrelation & tests
ACF ; PACF is the partial autocorrelation at lag controlling for intermediate lags. Ljung–Box tests jointly if the first autocorrelations are zero. ADF / KPSS test (non-)stationarity.
State-space models & the Kalman filter
Linear Gaussian state-space model:
Kalman recursion alternates predict and update. It is the optimal linear estimator and, under Gaussian noise, the optimal estimator full stop. Used in quant for dynamic hedge ratios, time-varying betas, and signal smoothing.
Portfolio Theory & Risk
Mean–variance optimization
Minimum-variance portfolio with target return :
Closed-form solution via Lagrange multipliers. Tangency (max-Sharpe) portfolio: .
Risk & performance metrics
| Metric | Definition | Notes |
|---|---|---|
| Sharpe ratio | Annualize by for daily. | |
| Sortino | Uses downside deviation only | Penalises only negative volatility. |
| Information ratio | Quality of active return vs benchmark. | |
| Max drawdown | Peak-to-trough loss. | |
| VaR | Quantile; not subadditive. | |
| CVaR / ES | Coherent; average tail loss. |
Covariance estimation
The sample covariance is noisy when and inverting it (as in MV optimization) amplifies that noise. Remedies: Ledoit–Wolf shrinkage ; factor-model covariance ; minimum-variance with no short-sell constraints acts as implicit regularization.
Backtesting: Hazards & Hygiene
- Look-ahead bias. Using data not yet available at the trade decision time. Easily hidden in resampling, normalization, or target encodings computed on the full sample.
- Survivorship bias. Universe includes only firms that exist today, not the dead ones. Systematically inflates returns.
- Selection bias. Data-mining a strategy from a large search space without correcting for multiplicity. See deflated Sharpe ratio and the Bailey–López de Prado corrections.
- Transaction costs & slippage. A signal that is alpha gross of costs can easily be zero or negative net of them.
- Regime / non-stationarity. A backtest spanning one regime cannot validate a strategy in another. Always report performance per sub-period.
- Purged / embargoed cross-validation. Remove observations whose label windows overlap training data; embargo a gap after each test fold to prevent leakage of serial correlation.
Time-series cross-validation
Forward-chaining expanding-window CV respects causality. For overlapping labels (e.g. forward -day returns), apply purging — remove training samples whose label window intersects the test window — and embargo — a buffer of length after the test set.
ML for Alpha — What Actually Matters
- Signal-to-noise is the enemy. Daily returns have SNR orders of magnitude lower than images or text. Capacity of the model must match the information in the data — enormous networks overfit noise.
- Prefer rank-based and robust losses. Squared error is dominated by outliers. Quantile, Huber, and rank-based losses (Spearman surrogates) typically generalise better for return prediction.
- Feature stability beats peak accuracy. A signal that works moderately across sub-periods is more valuable than one that is excellent in one and useless in another.
- Combine linear + non-linear. Linear factors remain the bulk of explanatory power; trees / NNs add interactions. Stack or residualise.
- Bet sizing is orthogonal to signal. Kelly (or a fractional Kelly) links expected edge and variance to position size. A good signal sized poorly is a loss-making strategy.
- Evaluate economic utility, not just statistical fit. Out-of-sample Sharpe, turnover, capacity, drawdown, and sensitivity to costs — in that order.
Exercises & Solutions
Each exercise is labelled by topic and difficulty (◆ introductory, ◆◆ intermediate, ◆◆◆ advanced). Solutions are hidden by default — attempt the problem before revealing the worked argument. The point of these is not the answer itself but the chain of reasoning; they are chosen to exercise canonical techniques you will reuse for the rest of your career.
Derive the bias–variance decomposition.
Let with and , . For a learned predictor (random over training sets), show that
Reveal solution
Let . Expand: Because is independent of with zero mean, cross terms vanish: Add and subtract inside the square: The cross term because . The two remaining terms are Bias and Variance, giving the claim.
Derive the closed form for ridge regression.
For , , , find minimising . Show the solution is always well-defined when .
Reveal solution
Expand the objective: . Using the matrix-calculus identities and for symmetric : is PSD with eigenvalues ; adding shifts them to , so the matrix is strictly positive definite and hence invertible. The Hessian confirms the critical point is a global minimum.
Gradient and Hessian of the logistic loss; prove convexity.
Let and . Derive and and show the Hessian is PSD.
Reveal solution
Key fact: . For a single point with : Chain through and sum: For the Hessian, differentiate again: , giving Since has non-negative entries, is PSD: for any , . So is convex.
Derivative of softmax + cross-entropy.
Let and with one-hot . Show .
Reveal solution
First derive . Writing : Now, Since is one-hot, , so . This is why the softmax+CE gradient has such a clean implementation — it is literally the prediction error.
PCA as an eigenvalue problem; relate to SVD.
Let be mean-centred. Show that the first principal component — the direction with maximising — is the top right-singular vector of , and the explained variance equals .
Reveal solution
Because is centred, . The maximisation is the Rayleigh quotient problem; its maximum equals the largest eigenvalue of , attained at the corresponding eigenvector. Write (SVD). Then , so the top eigenvector of is (first column of ) with eigenvalue . Hence the first PC is and its explained variance is .
Backpropagation through a 2-layer MLP.
Consider the network , , , , with cross-entropy loss against one-hot . Write all gradients in closed form.
Reveal solution
Output layer. By Exercise 4, . Then
Hidden layer. Propagate back: . Through the ReLU, (elementwise indicator, zero on the inactive units): Note the dying-ReLU pathology is visible here: once is negative, on that unit, and it never receives a gradient.
KL divergence is not symmetric.
Give concrete on such that . Then interpret the difference between forward and reverse KL when fitting to multi-modal .
Reveal solution
Take and . Then Not equal. Interpretation. Forward KL is mean-seeking / mass-covering: must put probability wherever does, so when is too simple it spreads across all modes of . Reverse KL is mode-seeking: collapses onto one mode, because any mass where is catastrophically penalised. Variational inference minimises reverse KL — hence VI's famous tendency to be over-confident.
Minimising cross-entropy is maximum likelihood.
Show that, for a classifier outputting , minimising empirical cross-entropy is equivalent to maximum likelihood and to minimising .
Reveal solution
Empirical cross-entropy is . Minimising this is exactly maximising the log-likelihood . Next, letting be the empirical distribution, The first term is , independent of ; the second, after splitting the log, contains , which is exactly the cross-entropy loss. So argmin KL = argmin CE = argmax likelihood.
Dropout as an implicit ensemble.
A network with dropout-able units applies independent Bernoulli masks. Explain why dropout approximates averaging over sub-networks, and why the weight rescaling at test time is required.
Reveal solution
Each training step samples a binary mask with independently — one of possible sub-networks. The training objective is , and SGD with single mask samples is an unbiased estimator of its gradient. Thus dropout optimizes an ensemble loss.
At test time we want the mean prediction . A cheap approximation is to deactivate the mask and scale each activation by — because each unit was "on" with probability during training, its expected contribution to downstream linear combinations is times its weight. In practice this is implemented as "inverted dropout": scale by during training, identity at test time. For a single linear layer, this is exact; for non-linear networks, it is a principled approximation to the geometric mean over sub-networks.
Why does scaled dot-product attention divide by ?
Queries and keys have components with approximately zero mean and unit variance. Show that without the factor, the variance of grows with , and argue why this hurts softmax training.
Reveal solution
Assume components are independent with mean 0 and variance 1. Then has mean 0 and, by independence and , variance . So the standard deviation grows as .
If logits have large magnitude, softmax concentrates its mass on the single largest entry. The derivative then collapses to near-zero across the board — vanishing gradients. Dividing by rescales variance back to , keeping the softmax in its well-behaved regime across model sizes.
Why does BatchNorm behave differently at train and test time?
Describe the exact computation of BatchNorm during training vs inference, explain why the two must differ, and identify a failure mode at small batch size.
Reveal solution
Training. For each feature (channel), compute batch mean and variance , normalise , then affine-transform with learned . The batch statistics couple the examples in a batch and introduce stochasticity (a mild regularizer).
Inference. Use running-average estimates accumulated during training (typically via exponential moving average). This must differ from training because (i) at inference, batches may be of size 1, making batch statistics meaningless, and (ii) predictions would otherwise depend on which other examples happen to be in the batch — unacceptable.
Failure mode. Small batches (say, 2–8) give high-variance , so training becomes noisy and the running averages accumulated for inference become unreliable — the classic "BN breaks for detection with large backbones". Remedies: GroupNorm, LayerNorm, or SyncBN across devices.
Why do vanilla RNN gradients vanish, and how does the LSTM cell fix it?
Derive the gradient of the loss at time w.r.t. a hidden state at time for a vanilla RNN and contrast the form with the LSTM cell-state path.
Reveal solution
Vanilla RNN. With , The gradient through steps is the product . Since and typically , and has spectral radius that is either (vanishing) or (exploding), the norm of this product tends to zero or infinity exponentially in . Long-range dependencies cannot be learned.
LSTM cell. The key is the cell update . Differentiating, Provided the forget gate stays close to 1, the product does not vanish, and the gradient path through the cell state is an (almost) straight linear pipe — comparable to a residual connection. This is what enables long-range dependencies.
Derive the evidence lower bound.
For latent-variable model and any distribution , show Conclude the first two terms form a lower bound on .
Reveal solution
Start from . Multiply and divide by : Alternatively, without Jensen, write (the inner is constant in ) and use : Insert and regroup: Since the final KL is non-negative, the ELBO is a lower bound, tight iff . Maximising the ELBO jointly in thus maximises a bound on data likelihood and drives towards the true posterior.
Why the bias correction in Adam?
In Adam, , with . Show that if is constant, . Hence explain why we divide by .
Reveal solution
Unrolling the recursion: Taking expectation with : Early in training (small ), is small, so underestimates — the moment is biased toward zero because of the zero initialisation. Dividing by produces an unbiased estimate . The same logic applies to with . Without this correction Adam would take unnecessarily small steps in its first few hundred iterations.
Annualise a daily Sharpe ratio.
A strategy's daily excess returns have mean and standard deviation . Assuming 252 trading days and i.i.d. returns, compute the daily and annual Sharpe ratios. Then discuss what breaks if returns are autocorrelated.
Reveal solution
Daily Sharpe . Under i.i.d. summation, annual mean and annual variance , so annual and With autocorrelation. . Positive autocorrelation inflates the variance; naive scaling overstates the Sharpe. Use an effective sample size or the Newey–West standard error and aggregate at an appropriate horizon. Strategies with strong serial correlation (e.g. trend-following on a single asset) are routinely reported with overstated Sharpes.
Derive the tangency (max-Sharpe) portfolio.
Given excess returns , covariance , find maximising .
Reveal solution
SR is invariant under for , so maximising SR is equivalent to subject to (a convenient normalisation). Lagrangian : Applying the normalization : , so . With a budget constraint : The resulting maximum Sharpe is .
Caveat. In practice, and must be estimated; estimation error in dominates, and amplifies noise. This is why out-of-sample MV portfolios routinely underperform equal-weighted — the "Markowitz optimization enigma". Shrink , use factor-model , or constrain the problem.
Spot the leakage.
A researcher fits a daily return predictor by (a) standardising each feature using its full-sample mean and variance; (b) shuffling the rows; (c) running 5-fold CV. Performance is stellar in CV and awful live. Identify at least three leakages and propose a corrected pipeline.
Reveal solution
Leakages.
- Full-sample standardisation. The mean and variance use future data. Correct: fit scaler on training fold only, transform test fold with those statistics.
- Shuffling. Breaks temporal ordering so that training folds contain days after the test fold. Correct: use forward-chaining (expanding or rolling window) CV.
- Label overlap. If the target is a forward -day return, consecutive samples share label windows — training samples on day leak information about the test sample on day . Correct: purge training rows whose label window overlaps the test set, and embargo a gap of at least after the test set before resuming training.
Additional checks. Ensure feature engineering (winsorization, rank transforms, target encoding) is performed within fold. Ensure universe selection at time uses only constituents that were actually in the index at time (survivorship). Finally, evaluate sensitivity to transaction costs and report performance per sub-period.
One-dimensional Kalman filter.
Scalar random walk , ; observations , . Write the predict and update steps and show the filter reduces to an exponentially-weighted moving average in steady state.
Reveal solution
Predict. , .
Update. Kalman gain . Posterior mean and variance .
Steady state. Setting and , imposing gives a quadratic whose positive root is Then — precisely an EWMA with smoothing parameter , governed by the signal-to-noise ratio . This is why, in practice, if you cannot calibrate a full state-space model, a well-chosen EWMA already captures most of the benefit.
Multiple testing and the deflated Sharpe.
You tried strategies and selected the one with the highest in-sample Sharpe, over days. Under the null of zero true Sharpe, estimate the expected maximum Sharpe to gauge whether is real.
Reveal solution
Under the null, an estimated Sharpe over i.i.d. Gaussian returns is approximately — standard deviation . For , .
For independent null Sharpes, the expected maximum is approximately (using the Gaussian extreme-value result ): So under the null we would already expect an annualised max Sharpe around — higher than the observed ! The selected strategy is, in fact, worse than what pure noise would produce over 1000 tries. This is the intuition behind the deflated Sharpe ratio (Bailey & López de Prado): subtract the expected maximum-under-null before claiming significance. Always report how many models / hyperparameter combinations were tried.
Parameter and FLOP counts for a Transformer layer.
Consider a single decoder-only Transformer layer with model dimension , attention heads (each of size ), and feed-forward width . Count parameters and the per-token FLOPs at context length .
Reveal solution
Parameters.
- Projections each : .
- MLP: and : .
- LayerNorms + biases: , negligible.
Total per layer; for layers, . A 12-layer model with : params plus embeddings.
Per-token FLOPs. Dominant costs, measuring a multiply-add as 2 FLOPs:
- Q/K/V/O projections: FLOPs per token.
- Attention scores : per token (each of keys, -dim dot product).
- Weighted sum of values: per token.
- MLP: per token.
Total per token. Attention becomes the bottleneck once — the rationale for sparse / linear / flash-attention variants and for keeping context tractable. The frequently-quoted " training FLOPs per token" (with the parameter count) is this same arithmetic, generalised.
That is the end of the cheatsheet. Bring these ideas to your next problem; rebuild them from scratch when you doubt them.