A Companion Volume · Revised Edition

Probability & Statistics
for Machine Learning and Quantitative Research

A compact field manual — the ideas most often invoked in research papers, interviews, and late-night whiteboard sessions — followed by exercises to be worked with pen in hand.

Cheatsheet 15 Exercises Worked Solutions
Part the First

The Cheatsheet

§ IFoundations of Probability

A probability space is a triple $(\Omega, \mathcal{F}, P)$: a sample space $\Omega$ of outcomes, a $\sigma$-algebra $\mathcal{F}$ of admissible events, and a measure $P$ assigning each event a number in $[0, 1]$. The structure looks abstract but it buys us two things: a rigorous notion of "event" (so we can talk about countable unions without contradictions) and the ability to condition cleanly.

Axioms

  • $P(A) \ge 0$ for every $A \in \mathcal{F}$.
  • $P(\Omega) = 1$.
  • For disjoint $A_1, A_2, \ldots$: $P\!\left(\bigcup_i A_i\right) = \sum_i P(A_i)$.

Conditioning and independence

When $P(B) \gt 0$, $P(A \mid B) = \dfrac{P(A \cap B)}{P(B)}$. Two events are independent iff $P(A \cap B) = P(A)\, P(B)$, equivalently $P(A \mid B) = P(A)$. Independence is an assumption — never an observation.

The two workhorses

Law of total probability
For a partition $\{B_i\}$ of $\Omega$: $\ P(A) = \sum_i P(A \mid B_i)\, P(B_i)$.
Bayes' theorem
$P(B \mid A) = \dfrac{P(A \mid B)\, P(B)}{P(A)} = \dfrac{P(A \mid B)\, P(B)}{\sum_i P(A \mid B_i)\, P(B_i)}$.
Chain rule
$P(A_1 \cap \cdots \cap A_n) = \prod_{i=1}^{n} P(A_i \mid A_1, \ldots, A_{i-1})$.
In machine learning, Bayes' theorem is the bloodstream: it converts likelihoods (what the model says about data) into posteriors (what the data says about the model).

§ IIRandom Variables

A random variable $X: \Omega \to \mathbb{R}$ is a measurable function. In practice we never care about $\Omega$ itself — we work with the induced distribution of $X$ through its CDF $F_X(x) = P(X \le x)$, which determines every probabilistic statement about $X$.

Discrete and continuous random variables in parallel.
QuantityDiscreteContinuous
Probability functionPMF $p(x) = P(X = x)$PDF $f(x)$, with $P(X \in A) = \int_A f$
Normalization$\sum_x p(x) = 1$$\int f(x)\, dx = 1$
Expectation$\mathbb{E}[X] = \sum_x x\, p(x)$$\mathbb{E}[X] = \int x\, f(x)\, dx$
LOTUS$\mathbb{E}[g(X)] = \sum_x g(x)\, p(x)$$\mathbb{E}[g(X)] = \int g(x)\, f(x)\, dx$

Moments and the moment-generating function

The $k$-th moment is $\mu_k = \mathbb{E}[X^k]$; the variance $\sigma^2 = \mathrm{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$. The moment-generating function $M_X(t) = \mathbb{E}[e^{tX}]$, when it exists in a neighborhood of $0$, determines the distribution uniquely and satisfies $M_X^{(k)}(0) = \mu_k$. For sums of independents, $M_{X+Y}(t) = M_X(t)\, M_Y(t)$.

Core identities

Linearity
$\mathbb{E}[aX + bY + c] = a\, \mathbb{E}[X] + b\, \mathbb{E}[Y] + c$ — always, no independence needed.
Variance of an affine map
$\mathrm{Var}(aX + b) = a^2\, \mathrm{Var}(X)$.
Variance of a sum
$\mathrm{Var}(X + Y) = \mathrm{Var}(X) + \mathrm{Var}(Y) + 2\,\mathrm{Cov}(X, Y)$.
Change of variables
If $Y = g(X)$ with $g$ smooth and monotone, $f_Y(y) = f_X(g^{-1}(y))\, \bigl|\tfrac{d}{dy} g^{-1}(y)\bigr|$.

§ IIIA Menagerie of Distributions

Most of applied probability is the fluent recognition of a handful of distributional shapes. The table below lists the ones worth knowing by heart — their parameters, first two moments, and the situations in which they naturally arise.

Discrete distributions.
NamePMFMeanVarianceArises as
Bernoulli$(p)$$p^x (1-p)^{1-x},\ x\in\{0,1\}$$p$$p(1-p)$A single coin flip
Binomial$(n, p)$$\binom{n}{x} p^x (1-p)^{n-x}$$np$$np(1-p)$Sum of $n$ Bernoullis
Geometric$(p)$$(1-p)^{x-1} p,\ x\ge 1$$1/p$$(1-p)/p^2$Trials until first success
Poisson$(\lambda)$$\dfrac{\lambda^x e^{-\lambda}}{x!}$$\lambda$$\lambda$Rare events in fixed interval
Negative Binomial$(r, p)$$\binom{x-1}{r-1} p^r (1-p)^{x-r}$$r/p$$r(1-p)/p^2$Trials until $r$-th success
Continuous distributions.
NamePDFMeanVarianceArises as
Uniform$(a, b)$$\dfrac{1}{b-a}$$\tfrac{a+b}{2}$$\tfrac{(b-a)^2}{12}$"No information" on an interval
Normal$(\mu, \sigma^2)$$\dfrac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$$\mu$$\sigma^2$Sums of many small effects (CLT)
Exponential$(\lambda)$$\lambda e^{-\lambda x},\ x \ge 0$$1/\lambda$$1/\lambda^2$Memoryless waiting time
Gamma$(k, \theta)$$\dfrac{x^{k-1} e^{-x/\theta}}{\theta^k \Gamma(k)}$$k\theta$$k\theta^2$Sum of $k$ Exp$(1/\theta)$s
Beta$(\alpha, \beta)$$\dfrac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha, \beta)}$$\tfrac{\alpha}{\alpha+\beta}$$\tfrac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}$Conjugate prior for Bernoulli
Chi-squared $\chi^2_k$Gamma$(k/2, 2)$ density$k$$2k$$\sum_{i=1}^k Z_i^2$ with $Z_i \sim N(0,1)$
Student-$t_\nu$$\propto \left(1 + \tfrac{x^2}{\nu}\right)^{-\frac{\nu+1}{2}}$$0$ ($\nu\gt 1$)$\tfrac{\nu}{\nu-2}$ ($\nu\gt 2$)Heavy-tailed $N/\sqrt{\chi^2/\nu}$

Multivariate Normal

$\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$ with density $f(\mathbf{x}) = (2\pi)^{-d/2}|\boldsymbol{\Sigma}|^{-1/2} \exp\!\bigl(-\tfrac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^{\!\top}\boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\bigr)$. Two key facts carry most of the weight in ML:

  • Affine closure. If $\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$, then $A\mathbf{X} + \mathbf{b} \sim \mathcal{N}(A\boldsymbol{\mu} + \mathbf{b},\, A\boldsymbol{\Sigma} A^{\!\top})$.
  • Marginals & conditionals are Gaussian — closed forms in §4.
For Gaussians, and only for Gaussians, zero correlation implies independence. Don't transport this intuition elsewhere.

§ IVJoint, Marginal, Conditional

For a pair $(X, Y)$ the joint density $f_{X,Y}(x, y)$ generates the marginals by integration and the conditionals by division:

$$f_X(x) = \int f_{X,Y}(x, y)\, dy, \qquad f_{Y \mid X}(y \mid x) = \frac{f_{X,Y}(x, y)}{f_X(x)}.$$

$X$ and $Y$ are independent iff $f_{X,Y}(x, y) = f_X(x)\, f_Y(y)$.

Covariance and correlation

Covariance
$\mathrm{Cov}(X, Y) = \mathbb{E}[(X - \mu_X)(Y - \mu_Y)] = \mathbb{E}[XY] - \mu_X \mu_Y$.
Correlation
$\rho(X, Y) = \mathrm{Cov}(X, Y) / (\sigma_X \sigma_Y) \in [-1, 1]$.
Bilinearity
$\mathrm{Cov}(aX + b,\, cY + d) = ac\, \mathrm{Cov}(X, Y)$.
Covariance matrix
$\boldsymbol{\Sigma}_{ij} = \mathrm{Cov}(X_i, X_j)$; always symmetric positive semidefinite.

Correlation measures linear dependence only. Independent $\Rightarrow$ uncorrelated; the converse fails in general (Gaussians excepted).

Conditional expectation

$\mathbb{E}[Y \mid X]$ is itself a random variable — a function of $X$. Two identities carry most of the analytic weight:

Tower / Adam's law
$\mathbb{E}[Y] = \mathbb{E}\!\bigl[\mathbb{E}[Y \mid X]\bigr]$.
Eve's law (total variance)
$\mathrm{Var}(Y) = \mathbb{E}\!\bigl[\mathrm{Var}(Y \mid X)\bigr] + \mathrm{Var}\!\bigl(\mathbb{E}[Y \mid X]\bigr)$.

The second decomposition is a decomposition of uncertainty: noise within each conditional world, plus variation across worlds.

Gaussian conditionals

Partition $\mathbf{X} = (\mathbf{X}_1, \mathbf{X}_2) \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$ with the obvious block structure. Then:

$$\mathbf{X}_1 \mid \mathbf{X}_2 = \mathbf{x}_2 \sim \mathcal{N}\!\bigl(\boldsymbol{\mu}_1 + \boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1}(\mathbf{x}_2 - \boldsymbol{\mu}_2),\ \boldsymbol{\Sigma}_{11} - \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21}\bigr).$$
This formula is Kalman filtering, Gaussian processes, and linear regression in a single line. Memorise it.

§ VInequalities & Concentration

Inequalities are the skeleton on which proofs in learning theory hang. The four below cover the vast majority of everyday use; the remaining two are the tools of the high-dimensional trade.

Markov
For $X \ge 0$ and $a \gt 0$: $P(X \ge a) \le \mathbb{E}[X] / a$.
Chebyshev
$P(|X - \mu| \ge k\sigma) \le 1/k^2$.
Jensen
If $\varphi$ is convex, $\varphi(\mathbb{E}[X]) \le \mathbb{E}[\varphi(X)]$.
Cauchy–Schwarz
$(\mathbb{E}[XY])^2 \le \mathbb{E}[X^2]\, \mathbb{E}[Y^2]$.

High-dimensional tools

Hoeffding
For independent $X_i \in [a_i, b_i]$ and $S = \sum X_i$: $P\bigl(|S - \mathbb{E}S| \ge t\bigr) \le 2 \exp\!\Bigl(\!-\dfrac{2t^2}{\sum_i (b_i - a_i)^2}\Bigr).$
Chernoff
$P(S \ge t) \le \inf_{\lambda \gt 0} e^{-\lambda t} M_S(\lambda)$ — the mother of all exponential tail bounds.
Generalisation bounds (PAC-learning, Rademacher complexity, VC) are exercises in applying concentration to empirical risk. When you see "with probability at least $1 - \delta$", a Hoeffding is usually lurking.

§ VIConvergence & Limit Theorems

Four modes of stochastic convergence, from strongest to weakest:

Almost sure
$X_n \xrightarrow{\text{a.s.}} X$ iff $P(\lim_n X_n = X) = 1$.
In probability
$X_n \xrightarrow{P} X$ iff $P(|X_n - X| \gt \varepsilon) \to 0$ for every $\varepsilon \gt 0$.
In $L^p$
$X_n \xrightarrow{L^p} X$ iff $\mathbb{E}[|X_n - X|^p] \to 0$.
In distribution
$X_n \xrightarrow{d} X$ iff $F_{X_n}(x) \to F_X(x)$ at every continuity point of $F_X$.

The implications: a.s. $\Rightarrow$ in probability $\Rightarrow$ in distribution. $L^p$ convergence for $p \ge 1$ implies convergence in probability.

Laws of large numbers

For i.i.d. $X_1, X_2, \ldots$ with finite mean $\mu$, let $\bar{X}_n = \tfrac{1}{n} \sum_{i=1}^n X_i$. Then:

  • Weak LLN. $\bar{X}_n \xrightarrow{P} \mu$.
  • Strong LLN. $\bar{X}_n \xrightarrow{\text{a.s.}} \mu$.

Central Limit Theorem

For i.i.d. $X_i$ with mean $\mu$ and finite variance $\sigma^2$:

$$\sqrt{n}\,(\bar{X}_n - \mu) \xrightarrow{d} \mathcal{N}(0,\, \sigma^2).$$

Equivalently, $\bar{X}_n \approx \mathcal{N}(\mu, \sigma^2 / n)$ for large $n$. The CLT is the reason Gaussian noise is everywhere: averaging destroys the fingerprints of the underlying distribution.

Delta method

If $\sqrt{n}(\hat\theta_n - \theta) \xrightarrow{d} \mathcal{N}(0, \sigma^2)$ and $g$ is differentiable at $\theta$ with $g'(\theta) \ne 0$, then:

$$\sqrt{n}\bigl(g(\hat\theta_n) - g(\theta)\bigr) \xrightarrow{d} \mathcal{N}\bigl(0,\, \bigl[g'(\theta)\bigr]^2 \sigma^2\bigr).$$

Slutsky

If $X_n \xrightarrow{d} X$ and $Y_n \xrightarrow{P} c$ (constant), then $X_n + Y_n \xrightarrow{d} X + c$ and $X_n Y_n \xrightarrow{d} cX$. Essential for constructing asymptotic confidence intervals when nuisance parameters are estimated.

§ VIIEstimation

An estimator $\hat\theta = T(X_1, \ldots, X_n)$ is a function of the data intended to approximate a parameter $\theta$. We judge it by:

Bias
$\mathrm{Bias}(\hat\theta) = \mathbb{E}[\hat\theta] - \theta$.
Variance
$\mathrm{Var}(\hat\theta) = \mathbb{E}[(\hat\theta - \mathbb{E}[\hat\theta])^2]$.
Mean squared error
$\mathrm{MSE}(\hat\theta) = \mathbb{E}[(\hat\theta - \theta)^2] = \mathrm{Bias}(\hat\theta)^2 + \mathrm{Var}(\hat\theta)$.
Consistency
$\hat\theta_n \xrightarrow{P} \theta$.
Efficiency
Attains the Cramér–Rao lower bound asymptotically.

Maximum likelihood

Given $X_1, \ldots, X_n \sim p(\cdot \mid \theta)$, the log-likelihood is $\ell(\theta) = \sum_{i=1}^n \log p(X_i \mid \theta)$. The MLE is $\hat\theta_{\text{MLE}} = \arg\max_\theta\, \ell(\theta)$. Under regularity conditions:

$$\sqrt{n}\bigl(\hat\theta_{\text{MLE}} - \theta_0\bigr) \xrightarrow{d} \mathcal{N}\!\bigl(0,\, \mathcal{I}(\theta_0)^{-1}\bigr),$$

where $\mathcal{I}(\theta) = -\mathbb{E}\!\left[\dfrac{\partial^2 \log p(X \mid \theta)}{\partial \theta^2}\right]$ is the Fisher information per observation. The MLE is asymptotically unbiased, consistent, and efficient.

Cramér–Rao bound

For any unbiased estimator $\hat\theta$ of $\theta$:

$$\mathrm{Var}(\hat\theta) \ge \frac{1}{n\,\mathcal{I}(\theta)}.$$

A universal floor on the variance — no cleverness recovers information that the likelihood didn't contain.

Method of moments

Set sample moments equal to theoretical moments and solve for $\theta$. Simpler than MLE, generally less efficient, but always a sanity check.

§ VIIIHypothesis Testing & Intervals

Anatomy of a test

A null hypothesis $H_0$ and an alternative $H_1$. A test statistic $T$ and a rejection region $R$. Two species of error:

  • Type I ($\alpha$): reject $H_0$ when it is true. The significance level.
  • Type II ($\beta$): fail to reject $H_0$ when $H_1$ is true. Power is $1 - \beta$.

The $p$-value is the probability, under $H_0$, of observing a statistic at least as extreme as the one observed. It is not the probability that $H_0$ is true — a distinction that has broken more inferences than any other.

The canonical tests

Z-test
Variance known: $Z = (\bar X - \mu_0)/(\sigma/\sqrt n) \sim \mathcal{N}(0,1)$ under $H_0$.
t-test
Variance estimated: $T = (\bar X - \mu_0)/(S/\sqrt n) \sim t_{n-1}$ under $H_0$.
Likelihood ratio
$\Lambda = \sup_{\theta \in \Theta_0} L(\theta) / \sup_{\theta \in \Theta} L(\theta)$. Reject for small $\Lambda$. Wilks: $-2 \log \Lambda \xrightarrow{d} \chi^2_k$ under $H_0$.

The Neyman–Pearson lemma tells us that for simple-versus-simple testing, the likelihood ratio test is uniformly most powerful.

Confidence intervals

A $100(1-\alpha)\%$ confidence interval $[L(X), U(X)]$ for $\theta$ satisfies $P_\theta(L \le \theta \le U) \ge 1 - \alpha$ for all $\theta$. The interval is random; $\theta$ is fixed.

For $\bar X$ from $\mathcal{N}(\mu, \sigma^2)$ (or large $n$ via CLT):

$$\bar X \pm z_{\alpha/2}\, \frac{\sigma}{\sqrt{n}} \qquad \text{or} \qquad \bar X \pm t_{n-1,\, \alpha/2}\, \frac{S}{\sqrt{n}}.$$
There is a duality: the set of $\theta_0$ for which the level-$\alpha$ test fails to reject forms a $100(1-\alpha)\%$ confidence region for $\theta$. Tests and intervals are two views of the same object.

§ IXBayesian Inference

The Bayesian views $\theta$ as random, encoding belief via a prior $\pi(\theta)$. Data $X$ updates it through the likelihood $p(X \mid \theta)$ into a posterior:

$$\pi(\theta \mid X) = \frac{p(X \mid \theta)\, \pi(\theta)}{\int p(X \mid \theta')\, \pi(\theta')\, d\theta'} \propto p(X \mid \theta)\, \pi(\theta).$$

Point estimates from the posterior

  • Posterior mean. $\hat\theta = \mathbb{E}[\theta \mid X]$ — minimises posterior MSE.
  • Posterior median. Minimises posterior absolute error.
  • MAP. $\hat\theta_{\text{MAP}} = \arg\max_\theta\, \pi(\theta \mid X)$ — minimises posterior $0$–$1$ loss.

Conjugate priors — worth knowing cold

Conjugate pairs with their update rules.
LikelihoodPriorPosterior
Bernoulli / BinomialBeta$(\alpha, \beta)$Beta$(\alpha + \sum x_i,\ \beta + n - \sum x_i)$
PoissonGamma$(\alpha, \beta)$Gamma$(\alpha + \sum x_i,\ \beta + n)$
Normal ($\sigma^2$ known)Normal$(\mu_0, \tau_0^2)$Normal, precision-weighted mean of $\mu_0$ and $\bar x$
Normal ($\mu$ known)Inverse-Gamma$(\alpha, \beta)$Inverse-Gamma$(\alpha + n/2,\ \beta + \tfrac{1}{2}\sum (x_i - \mu)^2)$
MultinomialDirichlet$(\boldsymbol{\alpha})$Dirichlet$(\boldsymbol{\alpha} + \mathbf{n})$

Credible intervals and predictives

A $(1-\alpha)$ credible interval is any interval of posterior mass $1 - \alpha$. Interpretively cleaner than a confidence interval — here $\theta$ really is the random thing.

The posterior predictive for new data $\tilde x$ is:

$$p(\tilde x \mid X) = \int p(\tilde x \mid \theta)\, \pi(\theta \mid X)\, d\theta,$$

which averages over parameter uncertainty instead of collapsing it to a point.

Regularisation and Bayesian priors are the same thing in different languages — §XI, Exercise 5.

§ XInformation Theory

Modern machine learning is not merely influenced by information theory — it often is information theory in a different hat. Loss functions are divergences; training is a minimum description length problem; attention is mutual information, softly.

Entropy

The Shannon entropy of a discrete $X$ with PMF $p$ is:

$$H(X) = -\sum_x p(x)\, \log p(x).$$

It is the expected surprise, in nats (base $e$) or bits (base $2$). Maximised by the uniform distribution; zero when $X$ is deterministic. For continuous $X$ with PDF $f$, the differential entropy $h(X) = -\int f(x) \log f(x)\, dx$ — useful, but not an entropy in the strict sense (it can be negative).

Joint, conditional, mutual

Joint
$H(X, Y) = -\sum p(x, y) \log p(x, y)$.
Conditional
$H(Y \mid X) = H(X, Y) - H(X) = \mathbb{E}_X\!\bigl[H(Y \mid X = x)\bigr]$.
Mutual information
$I(X; Y) = H(X) + H(Y) - H(X, Y) = H(Y) - H(Y \mid X)$.

$I(X; Y) \ge 0$, with equality iff $X \perp Y$. It is the canonical measure of statistical dependence beyond linearity.

KL divergence and cross-entropy

$$D_{\mathrm{KL}}(p \,\|\, q) = \sum_x p(x)\, \log \frac{p(x)}{q(x)}.$$

Not a metric — asymmetric and does not satisfy triangle inequality — but non-negative (Gibbs' inequality) and zero iff $p = q$ a.e. The cross-entropy $H(p, q) = -\sum p(x) \log q(x)$ satisfies $H(p, q) = H(p) + D_{\mathrm{KL}}(p \,\|\, q)$.

Minimising cross-entropy loss with one-hot labels is equivalent to minimising $D_{\mathrm{KL}}$ from the empirical distribution to the model. Variational inference is KL minimisation in the reverse direction. Maximum likelihood itself is cross-entropy minimisation in disguise.

The chain of inequalities

$0 \le H(Y \mid X) \le H(Y) \le \log |\mathcal{Y}|$, and the data-processing inequality: if $X \to Y \to Z$ is a Markov chain, $I(X; Z) \le I(X; Y)$. Processing cannot create information.

§ XILinear Models & Bias–Variance

Linear regression models $Y = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$ with $\boldsymbol{\varepsilon}$ mean-zero and variance $\sigma^2 I$. The ordinary least squares estimator:

$$\hat{\boldsymbol{\beta}} = (\mathbf{X}^{\!\top} \mathbf{X})^{-1} \mathbf{X}^{\!\top} \mathbf{Y}$$

is unbiased with covariance $\sigma^2 (\mathbf{X}^{\!\top} \mathbf{X})^{-1}$. The Gauss–Markov theorem asserts it is BLUE — the best linear unbiased estimator — under the stated conditions.

Regularised regression

Ridge
$\hat{\boldsymbol{\beta}}_{\text{ridge}} = \arg\min\,\|\mathbf{Y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda \|\boldsymbol{\beta}\|_2^2 = (\mathbf{X}^{\!\top} \mathbf{X} + \lambda I)^{-1} \mathbf{X}^{\!\top} \mathbf{Y}$.
Lasso
$\hat{\boldsymbol{\beta}}_{\text{lasso}} = \arg\min\,\|\mathbf{Y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda \|\boldsymbol{\beta}\|_1$ — sparsity-inducing.

Both are MAP estimators under Gaussian (ridge) or Laplace (lasso) priors on $\boldsymbol{\beta}$. See Exercise 5.

The bias–variance decomposition

For a fixed test point $x_0$, squared-error risk of an estimator $\hat f$ splits as:

$$\mathbb{E}\!\left[\bigl(Y - \hat f(x_0)\bigr)^2\right] = \underbrace{\sigma^2}_{\text{irreducible}} + \underbrace{\bigl(\mathbb{E}[\hat f(x_0)] - f(x_0)\bigr)^2}_{\text{bias}^2} + \underbrace{\mathrm{Var}(\hat f(x_0))}_{\text{variance}}.$$

No estimator eliminates the first term; complexity trades the second for the third.

Generalisation risk

Population risk $R(f) = \mathbb{E}_{(X, Y)}[\ell(f(X), Y)]$; empirical risk $\hat R(f) = \tfrac{1}{n} \sum \ell(f(X_i), Y_i)$. Training error underestimates test error systematically; the gap is bounded (with high probability) by the complexity of the function class — VC dimension, Rademacher complexity, metric-entropy arguments.

§ XIIStochastic Processes

Markov chains

A sequence $X_0, X_1, \ldots$ with the Markov property: $P(X_{n+1} \mid X_n, X_{n-1}, \ldots) = P(X_{n+1} \mid X_n)$. For finite state space, the dynamics live in a transition matrix $P$ with $P_{ij} = P(X_{n+1} = j \mid X_n = i)$.

A distribution $\boldsymbol{\pi}$ is stationary if $\boldsymbol{\pi} = \boldsymbol{\pi} P$. An irreducible aperiodic finite chain has a unique stationary distribution, and $P^n \to \mathbf{1} \boldsymbol{\pi}$ — the ergodic theorem.

Detailed balance: if $\pi_i P_{ij} = \pi_j P_{ji}$ for all $i, j$, then $\boldsymbol{\pi}$ is stationary. The starting point for MCMC: design $P$ so detailed balance holds for a target $\boldsymbol{\pi}$.

Poisson process

A counting process $N(t)$ with:

  • $N(0) = 0$;
  • independent increments;
  • $N(t + s) - N(s) \sim$ Poisson$(\lambda t)$.

Equivalently, inter-arrival times are i.i.d. Exp$(\lambda)$. The memorylessness of the exponential is what makes the Poisson process analytically tractable — and a near-universal model for rare independent events.

Brownian motion

A process $W_t$ with $W_0 = 0$, independent Gaussian increments $W_t - W_s \sim \mathcal{N}(0, t-s)$ for $s \lt t$, and continuous sample paths. The basic object of stochastic calculus; the limit of scaled random walks; the noise term in Black–Scholes. Non-differentiable everywhere with probability one — a warning that intuition built on smooth calculus can mislead here.

Martingales

$\{M_n\}$ is a martingale with respect to a filtration $\{\mathcal{F}_n\}$ if $\mathbb{E}[|M_n|] \lt \infty$, $M_n$ is $\mathcal{F}_n$-measurable, and:

$$\mathbb{E}[M_{n+1} \mid \mathcal{F}_n] = M_n.$$

Intuitively: a fair game. The optional stopping theorem (under bounded-stopping-time hypotheses) extends this to random stopping times: $\mathbb{E}[M_\tau] = \mathbb{E}[M_0]$. The one-line explanation for why you cannot beat a fair casino — and the workhorse of no-arbitrage pricing.

In quantitative finance, discounted price processes are martingales under the risk-neutral measure. Every derivative price is an expectation with respect to that measure.

§ XIIIMeasure-Theoretic Foundations

Everything so far has coasted on an informal reading of $P$ as "a function that assigns probabilities to events". For most working calculations this is enough — but for limit theorems, for infinite-dimensional processes like Brownian motion, and for the no-arbitrage pricing theorems of quantitative finance, we need the proper machinery. The payoff is not mere rigour; it is the ability to swap limits and integrals, change measures, and make sense of integrals against paths that have no derivative anywhere.

$\sigma$-algebras and measurable spaces

A $\sigma$-algebra $\mathcal{F}$ on a set $\Omega$ is a collection of subsets containing $\Omega$ and closed under complementation and countable unions. The pair $(\Omega, \mathcal{F})$ is a measurable space. The canonical example on $\mathbb{R}$ is the Borel $\sigma$-algebra $\mathcal{B}(\mathbb{R})$ — the smallest $\sigma$-algebra containing all open sets.

A measure $\mu: \mathcal{F} \to [0, \infty]$ satisfies $\mu(\emptyset) = 0$ and countable additivity: $\mu\!\left(\bigsqcup_i A_i\right) = \sum_i \mu(A_i)$. A probability measure is a measure with $\mu(\Omega) = 1$. Lebesgue measure on $\mathbb{R}$ is the unique translation-invariant Borel measure assigning length $b - a$ to $[a, b]$.

Measurable functions and random variables

$f: (\Omega, \mathcal{F}) \to (\mathbb{R}, \mathcal{B})$ is measurable if $f^{-1}(B) \in \mathcal{F}$ for every $B \in \mathcal{B}$. A random variable is precisely a measurable function on a probability space; the definition we gave in §II was an informal cousin.

The Lebesgue integral

Built in three stages:

  1. For simple functions $\varphi = \sum_{i=1}^n c_i \mathbf{1}_{A_i}$: $\int \varphi\, d\mu = \sum_i c_i\, \mu(A_i)$.
  2. For non-negative measurable $f$: $\int f\, d\mu = \sup\!\left\{\int \varphi\, d\mu : 0 \le \varphi \le f,\ \varphi \text{ simple}\right\}$.
  3. For general $f$: split $f = f^+ - f^-$ and set $\int f = \int f^+ - \int f^-$ whenever at least one is finite.

The Lebesgue integral extends the Riemann integral and behaves much better under limits — which is where its power lives.

The three pillars of convergence

Monotone Convergence
If $0 \le f_n \uparrow f$ a.e., then $\int f_n\, d\mu \to \int f\, d\mu$.
Fatou's Lemma
For non-negative measurable $f_n$: $\int \liminf_n f_n\, d\mu \le \liminf_n \int f_n\, d\mu$.
Dominated Convergence
If $f_n \to f$ a.e.\ and $|f_n| \le g$ with $\int g\, d\mu \lt \infty$, then $\int f_n\, d\mu \to \int f\, d\mu$.

These are the only reliable ways to swap $\lim$ and $\int$. Without them, nearly every probabilistic limit argument stalls.

Fubini–Tonelli

On a product measure space $(\Omega_1 \times \Omega_2, \mathcal{F}_1 \otimes \mathcal{F}_2, \mu_1 \otimes \mu_2)$, iterated integration is legal:

  • Tonelli: for non-negative measurable $f$, any order works and all are equal (possibly $+\infty$).
  • Fubini: for $f$ absolutely integrable, any order works and gives the same finite value.

Radon–Nikodym

Given $\sigma$-finite measures $\mu, \nu$ on $(\Omega, \mathcal{F})$ with $\mu \ll \nu$ (absolutely continuous: $\nu(A) = 0 \Rightarrow \mu(A) = 0$), there exists an essentially unique measurable $f \ge 0$ with:

$$\mu(A) = \int_A f\, d\nu \quad \text{for every } A \in \mathcal{F}.$$

We write $f = d\mu / d\nu$, the Radon–Nikodym derivative. A PDF is precisely $dP/d\lambda$ with respect to Lebesgue measure $\lambda$. A likelihood ratio is $dP_1/dP_0$. A pricing kernel is $d\mathbb{Q}/d\mathbb{P}$. Three different dialects for the same object.

Borel–Cantelli

For events $A_1, A_2, \ldots$ let $\{A_n \text{ i.o.}\} = \limsup_n A_n = \bigcap_{N} \bigcup_{n \ge N} A_n$ — the event that infinitely many $A_n$ occur. Then:

  • BC I. If $\sum_n P(A_n) \lt \infty$, then $P(A_n \text{ i.o.}) = 0$.
  • BC II. If the $A_n$ are independent and $\sum_n P(A_n) = \infty$, then $P(A_n \text{ i.o.}) = 1$.

A zero–one dichotomy that undergirds almost-sure convergence proofs.

Modes of convergence, revisited: almost-sure convergence is really pointwise convergence outside a $P$-null set, so it is preserved under measurable transformations. Convergence in probability is convergence in the $L^0$ (Ky Fan) pseudometric. Convergence in distribution is weak-$*$ convergence of measures — the sequence of pushforwards converges when tested against bounded continuous functions.

§ XIVFiltrations & Martingales

Conditional expectation, rigorously

For a sub-$\sigma$-algebra $\mathcal{G} \subseteq \mathcal{F}$ and $X \in L^1(\Omega, \mathcal{F}, P)$, the conditional expectation $\mathbb{E}[X \mid \mathcal{G}]$ is the $P$-a.s.\ unique $\mathcal{G}$-measurable random variable satisfying:

$$\int_A \mathbb{E}[X \mid \mathcal{G}]\, dP = \int_A X\, dP \quad \text{for every } A \in \mathcal{G}.$$

Existence is guaranteed by Radon–Nikodym applied to the signed measure $A \mapsto \int_A X\, dP$. For $X \in L^2$, $\mathbb{E}[X \mid \mathcal{G}]$ is exactly the orthogonal projection of $X$ onto the closed subspace $L^2(\mathcal{G})$ — conditional expectation is a Hilbert-space projection.

The axioms you will use every day

Linearity
$\mathbb{E}[aX + bY \mid \mathcal{G}] = a\,\mathbb{E}[X \mid \mathcal{G}] + b\,\mathbb{E}[Y \mid \mathcal{G}]$.
Tower
If $\mathcal{H} \subseteq \mathcal{G}$: $\mathbb{E}\!\bigl[\mathbb{E}[X \mid \mathcal{G}] \mid \mathcal{H}\bigr] = \mathbb{E}[X \mid \mathcal{H}]$.
Take out what is known
If $Y$ is $\mathcal{G}$-measurable and $XY \in L^1$: $\mathbb{E}[XY \mid \mathcal{G}] = Y\, \mathbb{E}[X \mid \mathcal{G}]$.
Independence
If $X \perp \mathcal{G}$: $\mathbb{E}[X \mid \mathcal{G}] = \mathbb{E}[X]$.
Conditional Jensen
For convex $\varphi$: $\varphi\!\bigl(\mathbb{E}[X \mid \mathcal{G}]\bigr) \le \mathbb{E}[\varphi(X) \mid \mathcal{G}]$.

Filtrations, adaptedness, stopping times

A filtration is an increasing family $\{\mathcal{F}_t\}_{t \ge 0}$ of sub-$\sigma$-algebras of $\mathcal{F}$. Read $\mathcal{F}_t$ as "the information observable by time $t$". A process $X_t$ is adapted if $X_t$ is $\mathcal{F}_t$-measurable for every $t$.

A stopping time is a $[0, \infty]$-valued random variable $\tau$ with $\{\tau \le t\} \in \mathcal{F}_t$ for every $t$. First-entry times into a Borel set are stopping times; last-exit times generally are not — you would need the future to know.

Martingales, sub- and super-

An adapted, integrable process $M_t$ is:

  • a martingale if $\mathbb{E}[M_t \mid \mathcal{F}_s] = M_s$ for $s \le t$;
  • a submartingale if $\mathbb{E}[M_t \mid \mathcal{F}_s] \ge M_s$;
  • a supermartingale if $\mathbb{E}[M_t \mid \mathcal{F}_s] \le M_s$.

Mnemonic: supermartingales go down (in expectation). Your wealth in a casino is a supermartingale. Your wealth at the roulette wheel of a fair game is a martingale.

Doob's maximal inequality

For a non-negative submartingale $X_t$ and $\lambda \gt 0$:

$$P\!\left(\sup_{s \le t} X_s \ge \lambda\right) \le \frac{\mathbb{E}[X_t]}{\lambda}.$$

The $L^p$ version ($p \gt 1$):

$$\mathbb{E}\!\left[\sup_{s \le t} X_s^p\right] \le \left(\frac{p}{p-1}\right)^{\!p} \mathbb{E}[X_t^p].$$

Remarkably, control of $X_t$ in $L^p$ gives control of the running maximum in $L^p$ — a uniform-in-time bound from a pointwise one.

Optional stopping, with honest hypotheses

For a martingale $M$ and stopping time $\tau$, $\mathbb{E}[M_\tau] = \mathbb{E}[M_0]$ provided at least one of the following:

  • $\tau$ is bounded by some constant $T \lt \infty$;
  • $M$ is uniformly bounded and $\tau \lt \infty$ a.s.;
  • $\mathbb{E}[\tau] \lt \infty$ and the increments $|M_n - M_{n-1}|$ are uniformly bounded.

Without some such condition, the theorem fails: a simple random walk is a martingale, and $\tau = \inf\{n : S_n = 1\}$ is finite a.s., yet $\mathbb{E}[S_\tau] = 1 \ne 0 = \mathbb{E}[S_0]$. The failure: $\mathbb{E}[\tau] = \infty$.

Martingale convergence

An $L^1$-bounded martingale — one with $\sup_t \mathbb{E}[|M_t|] \lt \infty$ — converges $P$-a.s.\ to a limit $M_\infty \in L^1$. In particular, a non-negative supermartingale always converges a.s. These are tools for proving a.s.\ convergence of algorithms (stochastic approximation, likelihood ratios, posteriors).

In machine learning, likelihood ratios form martingales under the true data-generating measure; the martingale convergence theorem is what ultimately justifies the asymptotic behaviour of Bayes factors and sequential probability ratio tests.

§ XVBrownian Motion & Itô Calculus

Brownian motion, properly

A standard Brownian motion (Wiener process) $\{W_t\}_{t \ge 0}$ on a filtered probability space $(\Omega, \mathcal{F}, \{\mathcal{F}_t\}, P)$ satisfies:

  • $W_0 = 0$ a.s.;
  • independent increments: for $s \lt t$, $W_t - W_s$ is independent of $\mathcal{F}_s$;
  • Gaussian increments: $W_t - W_s \sim \mathcal{N}(0, t - s)$;
  • continuous sample paths a.s.

Existence is a theorem (Wiener 1923): several constructions work — Lévy's piecewise-linear approximation, Kolmogorov's extension via finite-dimensional distributions, or Donsker's invariance principle (random walks, rescaled, converge in distribution to $W$).

Pathological pathwise regularity

With probability one:

  • $t \mapsto W_t$ is continuous but nowhere differentiable;
  • $W$ has infinite total variation on every interval;
  • the quadratic variation is finite and deterministic: $\langle W \rangle_t = t$, meaning $\sum_i (W_{t_{i+1}} - W_{t_i})^2 \xrightarrow{L^2} t$ as the partition mesh shrinks.

The finite quadratic variation is the whole key. It is what prevents a classical Riemann–Stieltjes integral $\int H\, dW$ from existing, and what forces the Itô construction.

Useful symmetries

If $W$ is standard Brownian motion, so are: $-W_t$ (reflection); $W_{t+s} - W_s$ (Markov property); $\sqrt{c}\, W_{t/c}$ (scaling); $t\, W_{1/t}$ for $t \gt 0$ (time inversion). These identities are how closed-form quantities get computed.

The Itô integral

For an adapted process $H$ with $\mathbb{E}\!\int_0^T H_s^2\, ds \lt \infty$, the Itô integral $\int_0^t H_s\, dW_s$ is constructed as the $L^2$-limit of "left-endpoint" Riemann sums $\sum_i H_{t_i}(W_{t_{i+1}} - W_{t_i})$. The choice of left endpoint is not cosmetic: evaluating at the midpoint produces the Stratonovich integral, which behaves like ordinary calculus but destroys the martingale property.

Itô isometry (the identity that makes the whole edifice work):

$$\mathbb{E}\!\left[\left(\int_0^t H_s\, dW_s\right)^{\!2}\right] = \mathbb{E}\!\left[\int_0^t H_s^2\, ds\right].$$

As a consequence, $M_t := \int_0^t H_s\, dW_s$ is a mean-zero martingale with variance $\mathbb{E}\!\int_0^t H_s^2\, ds$.

Itô's formula

Let $X_t$ be an Itô process: $dX_t = \mu_t\, dt + \sigma_t\, dW_t$. For $f \in C^{1,2}([0, T] \times \mathbb{R})$:

$$df(t, X_t) = \frac{\partial f}{\partial t}\, dt + \frac{\partial f}{\partial x}\, dX_t + \frac{1}{2} \frac{\partial^2 f}{\partial x^2}\, \sigma_t^2\, dt.$$

The extra half-second-derivative term — sometimes called the Itô correction — is the mathematical signature of quadratic variation. In an informal calculus: $(dW_t)^2 = dt$.

Geometric Brownian motion

The SDE $dS_t = \mu\, S_t\, dt + \sigma\, S_t\, dW_t$ has closed-form solution:

$$S_t = S_0 \exp\!\left(\bigl(\mu - \tfrac{1}{2}\sigma^2\bigr) t + \sigma\, W_t\right).$$

Apply Itô to $f(x) = \log x$ to see where the $-\tfrac{1}{2}\sigma^2$ comes from. This is the foundational model of log-normal asset prices in Black–Scholes theory.

Girsanov's theorem

Given an adapted process $\theta_t$ satisfying Novikov's condition $\mathbb{E}\exp\!\bigl(\tfrac{1}{2}\int_0^T \theta_s^2\, ds\bigr) \lt \infty$, define a new measure $\mathbb{Q}$ by:

$$\frac{d\mathbb{Q}}{d\mathbb{P}} = \exp\!\left(-\int_0^T \theta_s\, dW_s - \frac{1}{2} \int_0^T \theta_s^2\, ds\right).$$

Then $\widetilde W_t := W_t + \int_0^t \theta_s\, ds$ is a standard Brownian motion under $\mathbb{Q}$. Drift is not intrinsic to the process — it is a feature of the measure.

Girsanov is the engine of risk-neutral pricing. To price a derivative, change measure from the real-world $\mathbb{P}$ to the risk-neutral $\mathbb{Q}$ under which discounted asset prices are martingales; then the price is $\mathbb{E}_{\mathbb{Q}}[\text{discounted payoff}]$. Black–Scholes, interest-rate models, and credit pricing all rest on this single theorem.
❦ · ❦ · ❦
Part the Second

Exercises & Solutions

Work each problem with pen in hand before opening the solution. The solution is a conversation; the struggle is the lesson.

Exercise 01 Rare Disease, Flawed Test Bayes

A disease has prevalence $0.1\%$ in the population. A test has sensitivity $99\%$ (true positive rate) and specificity $98\%$ (true negative rate). A random individual tests positive. What is the probability they have the disease?

Open solution

Let $D$ denote "has disease" and $+$ the positive test. We want $P(D \mid +)$. By Bayes:

$$P(D \mid +) = \frac{P(+ \mid D)\, P(D)}{P(+ \mid D)\, P(D) + P(+ \mid D^c)\, P(D^c)}.$$

Plug in $P(D) = 0.001$, $P(+ \mid D) = 0.99$, $P(+ \mid D^c) = 0.02$:

$$P(D \mid +) = \frac{0.99 \cdot 0.001}{0.99 \cdot 0.001 + 0.02 \cdot 0.999} = \frac{0.00099}{0.02097} \approx 0.0472.$$

Despite a seemingly accurate test, the posterior is under $5\%$. Base rates dominate when the disease is rare — a result that recurs in imbalanced classification and fraud detection. The false positive rate is small per negative case, but negatives are so numerous they flood the positives.

Exercise 02 Linearity via Indicators Expectation

$n$ balls are thrown independently and uniformly into $n$ bins. Let $X$ be the number of empty bins after all throws. Compute $\mathbb{E}[X]$.

Open solution

Let $I_j = \mathbf{1}\{\text{bin } j \text{ is empty}\}$ for $j = 1, \ldots, n$, so that $X = \sum_{j=1}^n I_j$. By linearity of expectation — which holds without independence:

$$\mathbb{E}[X] = \sum_{j=1}^n \mathbb{E}[I_j] = \sum_{j=1}^n P(\text{bin } j \text{ empty}).$$

Each ball independently avoids bin $j$ with probability $1 - 1/n$. Across $n$ balls:

$$P(\text{bin } j \text{ empty}) = \left(1 - \frac{1}{n}\right)^n.$$

Hence $\mathbb{E}[X] = n\bigl(1 - 1/n\bigr)^n \to n/e$ as $n \to \infty$. Around $36.8\%$ of bins sit empty in expectation — an effect responsible for hash collisions, birthday paradoxes, and the coupon collector's friends.

The key move: never compute joint distributions when you can compute marginals. The $I_j$ are dependent, but linearity doesn't care.

Exercise 03 Law of Total Variance Conditional Moments

A coin's bias $P \sim \mathrm{Uniform}(0, 1)$. Conditional on $P = p$, we flip the coin $n$ times and let $X$ be the number of heads. Find $\mathbb{E}[X]$ and $\mathrm{Var}(X)$.

Open solution

Expectation. Conditional on $P$, $X \sim \mathrm{Binomial}(n, P)$, so $\mathbb{E}[X \mid P] = nP$. By the tower rule:

$$\mathbb{E}[X] = \mathbb{E}\!\bigl[\mathbb{E}[X \mid P]\bigr] = \mathbb{E}[nP] = n \cdot \tfrac{1}{2} = \tfrac{n}{2}.$$

Variance. We need both pieces of Eve's law:

  • $\mathrm{Var}(X \mid P) = nP(1 - P)$, so $\mathbb{E}[\mathrm{Var}(X \mid P)] = n\, \mathbb{E}[P(1-P)] = n\!\left(\tfrac{1}{2} - \tfrac{1}{3}\right) = \tfrac{n}{6}$.
  • $\mathbb{E}[X \mid P] = nP$, so $\mathrm{Var}(\mathbb{E}[X \mid P]) = n^2 \mathrm{Var}(P) = n^2/12$.

Summing:

$$\mathrm{Var}(X) = \frac{n}{6} + \frac{n^2}{12}.$$

Compare with a Binomial$(n, 1/2)$ of variance $n/4$: the uniform prior on $P$ inflates variance by the $n^2/12$ term — the "variation across parameter worlds". Aleatoric plus epistemic uncertainty, in essence.

Exercise 04 MLE for the Exponential Estimation

Let $X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \mathrm{Exp}(\lambda)$. Derive the maximum likelihood estimator of $\lambda$, determine its bias, and compute its asymptotic variance.

Open solution

Likelihood. $L(\lambda) = \prod_{i=1}^n \lambda e^{-\lambda x_i} = \lambda^n \exp\!\left(-\lambda \sum x_i\right)$. The log-likelihood:

$$\ell(\lambda) = n \log \lambda - \lambda \sum_{i=1}^n x_i.$$

Setting $\ell'(\lambda) = n/\lambda - \sum x_i = 0$ gives:

$$\hat\lambda_{\text{MLE}} = \frac{n}{\sum_{i=1}^n X_i} = \frac{1}{\bar X}.$$

Bias. $\sum X_i \sim \mathrm{Gamma}(n, 1/\lambda)$, so $1/\sum X_i$ has an inverse-Gamma distribution. A direct calculation using $\mathbb{E}[1/Y]$ for $Y \sim \mathrm{Gamma}(n, 1/\lambda)$:

$$\mathbb{E}[\hat\lambda] = \mathbb{E}\!\left[\frac{n}{\sum X_i}\right] = n \cdot \frac{\lambda}{n-1} = \frac{n}{n-1}\, \lambda.$$

Biased upward, by a factor that vanishes as $n \to \infty$. The bias-corrected estimator is $\tilde\lambda = \tfrac{n-1}{n} \hat\lambda$.

Asymptotic variance. The Fisher information is:

$$\mathcal{I}(\lambda) = -\mathbb{E}\!\left[\frac{\partial^2}{\partial \lambda^2} \log f(X; \lambda)\right] = -\mathbb{E}\!\left[-\frac{1}{\lambda^2}\right] = \frac{1}{\lambda^2}.$$

By the MLE asymptotic normality theorem:

$$\sqrt{n}(\hat\lambda - \lambda) \xrightarrow{d} \mathcal{N}(0, \lambda^2).$$
Exercise 05 Ridge Regression as MAP Bayesian / ML

Assume $\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$ with $\boldsymbol{\varepsilon} \sim \mathcal{N}(0, \sigma^2 I)$ and prior $\boldsymbol{\beta} \sim \mathcal{N}(0, \tau^2 I)$. Show that the MAP estimator of $\boldsymbol{\beta}$ is the ridge-regression solution, and identify the regularisation parameter.

Open solution

The posterior density satisfies $\pi(\boldsymbol{\beta} \mid \mathbf{Y}) \propto p(\mathbf{Y} \mid \boldsymbol{\beta})\, \pi(\boldsymbol{\beta})$. Taking logs:

$$\log \pi(\boldsymbol{\beta} \mid \mathbf{Y}) = -\frac{1}{2\sigma^2} \|\mathbf{Y} - \mathbf{X}\boldsymbol{\beta}\|^2 - \frac{1}{2\tau^2} \|\boldsymbol{\beta}\|^2 + \text{const}.$$

Maximising is the same as minimising the negative, and multiplying by $2\sigma^2$ preserves the argmin:

$$\hat{\boldsymbol{\beta}}_{\text{MAP}} = \arg\min_{\boldsymbol{\beta}}\ \|\mathbf{Y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda\, \|\boldsymbol{\beta}\|^2, \qquad \lambda = \frac{\sigma^2}{\tau^2}.$$

Exactly the ridge objective. The closed-form solution is $\hat{\boldsymbol{\beta}} = (\mathbf{X}^{\!\top}\mathbf{X} + \lambda I)^{-1} \mathbf{X}^{\!\top}\mathbf{Y}$.

Interpretation. A tight prior ($\tau$ small) shrinks estimates aggressively toward zero; a diffuse prior ($\tau \to \infty$) recovers OLS. Replacing the Gaussian prior with a Laplace prior $\boldsymbol{\beta}_i \stackrel{\text{iid}}{\sim} \mathrm{Laplace}(0, b)$ gives an $\ell_1$ penalty — the lasso — and with it, sparsity.

This exercise is really a worked proof that regularisation and priors are one object dressed in two vocabularies.

Exercise 06 Bias–Variance Decomposition ML Theory

Let $Y = f(x_0) + \varepsilon$ with $\mathbb{E}[\varepsilon] = 0$, $\mathrm{Var}(\varepsilon) = \sigma^2$, and $\varepsilon$ independent of the training data. Let $\hat f$ be an estimator trained on i.i.d. data. Prove that the expected squared-error at $x_0$ decomposes as:

$$\mathbb{E}\!\bigl[(Y - \hat f(x_0))^2\bigr] = \sigma^2 + \bigl(\mathbb{E}[\hat f(x_0)] - f(x_0)\bigr)^2 + \mathrm{Var}(\hat f(x_0)).$$
Open solution

Write $\hat f \equiv \hat f(x_0)$ and $f \equiv f(x_0)$ for brevity. Expand around $f$:

$$Y - \hat f = (f + \varepsilon) - \hat f = \varepsilon + (f - \hat f).$$

Square and take expectation:

$$\mathbb{E}[(Y - \hat f)^2] = \mathbb{E}[\varepsilon^2] + 2\, \mathbb{E}[\varepsilon (f - \hat f)] + \mathbb{E}[(f - \hat f)^2].$$

The cross term vanishes: $\varepsilon$ is independent of $\hat f$ (which depends only on training data), and $\mathbb{E}[\varepsilon] = 0$, so $\mathbb{E}[\varepsilon(f - \hat f)] = (f - \mathbb{E}[\hat f])\cdot 0 = 0$. The first term is $\sigma^2$.

For the third term, add and subtract $\mathbb{E}[\hat f]$:

$$\mathbb{E}[(f - \hat f)^2] = \mathbb{E}\!\bigl[\bigl((f - \mathbb{E}[\hat f]) + (\mathbb{E}[\hat f] - \hat f)\bigr)^2\bigr].$$

Expanding, $(f - \mathbb{E}[\hat f])$ is a constant, and the cross term $2(f - \mathbb{E}[\hat f])\, \mathbb{E}[\mathbb{E}[\hat f] - \hat f] = 0$. Hence:

$$\mathbb{E}[(f - \hat f)^2] = \underbrace{(f - \mathbb{E}[\hat f])^2}_{\text{bias}^2} + \underbrace{\mathbb{E}[(\hat f - \mathbb{E}[\hat f])^2]}_{\text{variance}}.$$

Combining, the claim follows. $\sigma^2$ is irreducible — no choice of model eliminates it; the rest is the eternal tradeoff between approximation error (bias) and estimation error (variance), mediated by model complexity.

Exercise 07 KL Between Two Gaussians Information

Let $p = \mathcal{N}(\mu_1, \sigma_1^2)$ and $q = \mathcal{N}(\mu_2, \sigma_2^2)$. Derive a closed form for $D_{\mathrm{KL}}(p \,\|\, q)$.

Open solution

By definition,

$$D_{\mathrm{KL}}(p \,\|\, q) = \mathbb{E}_p\!\left[\log \frac{p(X)}{q(X)}\right] = \mathbb{E}_p[\log p(X)] - \mathbb{E}_p[\log q(X)].$$

For a Gaussian, $\log p(x) = -\tfrac{1}{2}\log(2\pi\sigma_1^2) - \tfrac{(x-\mu_1)^2}{2\sigma_1^2}$, so:

$$\mathbb{E}_p[\log p(X)] = -\tfrac{1}{2}\log(2\pi\sigma_1^2) - \tfrac{1}{2}.$$

(Since $\mathbb{E}_p[(X - \mu_1)^2] = \sigma_1^2$.) Likewise,

$$\mathbb{E}_p[\log q(X)] = -\tfrac{1}{2}\log(2\pi\sigma_2^2) - \frac{\mathbb{E}_p[(X - \mu_2)^2]}{2\sigma_2^2}.$$

Expand $(X - \mu_2)^2 = (X - \mu_1)^2 + 2(X - \mu_1)(\mu_1 - \mu_2) + (\mu_1 - \mu_2)^2$ and take expectation under $p$:

$$\mathbb{E}_p[(X - \mu_2)^2] = \sigma_1^2 + (\mu_1 - \mu_2)^2.$$

Putting it together:

$$\boxed{\,D_{\mathrm{KL}}(p \,\|\, q) = \log \frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2}.\,}$$

The multivariate analogue, with $\mathcal{N}(\boldsymbol{\mu}_1, \boldsymbol{\Sigma}_1)$ and $\mathcal{N}(\boldsymbol{\mu}_2, \boldsymbol{\Sigma}_2)$ in $d$ dimensions:

$$D_{\mathrm{KL}} = \tfrac{1}{2}\!\left[\log \tfrac{|\boldsymbol{\Sigma}_2|}{|\boldsymbol{\Sigma}_1|} - d + \mathrm{tr}(\boldsymbol{\Sigma}_2^{-1} \boldsymbol{\Sigma}_1) + (\boldsymbol{\mu}_2 - \boldsymbol{\mu}_1)^{\!\top} \boldsymbol{\Sigma}_2^{-1} (\boldsymbol{\mu}_2 - \boldsymbol{\mu}_1)\right].$$

This formula is the analytic heartbeat of variational autoencoders: the KL term between encoder $q_\phi(z \mid x)$ and standard-normal prior is an instance of it.

Exercise 08 Beta–Binomial Conjugacy Bayesian

You place a Beta$(\alpha, \beta)$ prior on a coin's probability of heads $\theta$ and observe $k$ heads in $n$ flips. Derive the posterior distribution and compare the posterior mean, MLE, and prior mean.

Open solution

Likelihood: $p(k \mid \theta) = \binom{n}{k} \theta^k (1 - \theta)^{n-k}$. Prior: $\pi(\theta) \propto \theta^{\alpha - 1} (1 - \theta)^{\beta - 1}$. The posterior is their product, up to normalisation:

$$\pi(\theta \mid k) \propto \theta^{\alpha + k - 1} (1 - \theta)^{\beta + n - k - 1}.$$

This is a Beta density with updated parameters:

$$\theta \mid k \sim \mathrm{Beta}(\alpha + k,\ \beta + n - k).$$

The posterior mean is:

$$\hat\theta_{\text{post}} = \frac{\alpha + k}{\alpha + \beta + n}.$$

Rewrite as a convex combination:

$$\hat\theta_{\text{post}} = \frac{\alpha + \beta}{\alpha + \beta + n} \cdot \underbrace{\frac{\alpha}{\alpha + \beta}}_{\text{prior mean}} + \frac{n}{\alpha + \beta + n} \cdot \underbrace{\frac{k}{n}}_{\text{MLE}}.$$

The posterior mean is a precision-weighted average of the prior mean and the MLE. As $n \to \infty$, data dominates and the estimator converges to $k/n$. For small $n$ or strong prior ($\alpha + \beta$ large), the prior dominates. This is the Bayesian version of shrinkage.

The parameters $\alpha$ and $\beta$ act as pseudo-counts: a Beta$(\alpha, \beta)$ prior is equivalent to having observed $\alpha - 1$ prior heads and $\beta - 1$ prior tails — an interpretation that makes the update rule almost trivial to remember.

Exercise 09 Sample Size via CLT Asymptotics

A trading strategy's daily P&L has unknown mean $\mu$ and standard deviation $\sigma \approx 2\%$. You want a $95\%$ confidence interval for $\mu$ with half-width at most $0.1\%$. How many trading days do you need?

Open solution

By the CLT, $\bar X_n \approx \mathcal{N}(\mu, \sigma^2 / n)$ for large $n$. A $95\%$ interval has half-width:

$$h = z_{0.025} \cdot \frac{\sigma}{\sqrt{n}} \approx 1.96 \cdot \frac{\sigma}{\sqrt{n}}.$$

Requiring $h \le 0.001$ with $\sigma = 0.02$:

$$\sqrt{n} \ge \frac{1.96 \cdot 0.02}{0.001} = 39.2 \quad\Longrightarrow\quad n \ge 1537.$$

Roughly six trading years — sobering. The dependence $h \propto 1/\sqrt{n}$ is the fundamental drag of statistics: to halve the uncertainty, you need four times the data. To reach a tenth, a hundred times.

Assumptions worth auditing. The CLT requires finite variance and (for this calculation's cleanness) approximate independence. Financial returns show volatility clustering and heavy tails — the effective sample size can be much smaller than the nominal one, and the Gaussian approximation in the tail is unreliable exactly where risk lives.

Exercise 10 Conditional Multivariate Normal Gaussians

Let $(X, Y) \sim \mathcal{N}\!\left(\begin{pmatrix}\mu_X \\ \mu_Y\end{pmatrix},\ \begin{pmatrix}\sigma_X^2 & \rho \sigma_X \sigma_Y \\ \rho \sigma_X \sigma_Y & \sigma_Y^2\end{pmatrix}\right).$ Derive $\mathbb{E}[Y \mid X = x]$ and $\mathrm{Var}(Y \mid X = x)$.

Open solution

Apply the general Gaussian conditioning formula from §IV with blocks of size 1:

$$\mathbb{E}[Y \mid X = x] = \mu_Y + \Sigma_{YX} \Sigma_{XX}^{-1} (x - \mu_X) = \mu_Y + \rho \frac{\sigma_Y}{\sigma_X}(x - \mu_X).$$ $$\mathrm{Var}(Y \mid X = x) = \Sigma_{YY} - \Sigma_{YX} \Sigma_{XX}^{-1} \Sigma_{XY} = \sigma_Y^2 (1 - \rho^2).$$

Two observations.

  1. The conditional mean is linear in $x$ with slope $\rho \sigma_Y / \sigma_X$. This is exactly the population regression coefficient of $Y$ on $X$. Gaussian conditioning is linear regression.
  2. The conditional variance does not depend on $x$ — Gaussians are homoscedastic. It shrinks by the factor $1 - \rho^2$: larger correlation means more information from $X$ about $Y$.

Sanity check. If $\rho = 0$: $\mathbb{E}[Y \mid X] = \mu_Y$, $\mathrm{Var}(Y \mid X) = \sigma_Y^2$. Independence recovered. If $|\rho| \to 1$: $\mathrm{Var}(Y \mid X) \to 0$; knowing $X$ pins down $Y$ exactly.

This is the entire mathematical content of a Gaussian Process at a single test point, conditional on training observations. Everything else in GP regression is bookkeeping.
Exercise 11 Chebyshev in Practice Concentration

Let $X_1, \ldots, X_n$ be i.i.d. with mean $\mu$ and variance $\sigma^2$. Using Chebyshev, find an $n$ guaranteeing $P(|\bar X_n - \mu| \ge \varepsilon) \le \delta$. Compare with the Hoeffding bound when $X_i \in [0, 1]$.

Open solution

Chebyshev. $\bar X_n$ has mean $\mu$ and variance $\sigma^2 / n$. Applying the inequality:

$$P(|\bar X_n - \mu| \ge \varepsilon) \le \frac{\sigma^2}{n \varepsilon^2}.$$

For the bound to be at most $\delta$, pick $n \ge \sigma^2 / (\delta \varepsilon^2)$. The dependence is $1/\delta$ — polynomial in the confidence parameter.

Hoeffding. For $X_i \in [0, 1]$:

$$P(|\bar X_n - \mu| \ge \varepsilon) \le 2 e^{-2 n \varepsilon^2}.$$

To make this at most $\delta$: $n \ge \dfrac{1}{2\varepsilon^2} \log(2/\delta)$. Dependence on $\delta$ is $\log(1/\delta)$ — exponentially better.

Concrete comparison. Take $\varepsilon = 0.05$, $\delta = 0.01$, and bounded $X_i \in [0,1]$ so $\sigma^2 \le 1/4$. Chebyshev demands $n \ge 10{,}000$; Hoeffding demands only $n \ge 1{,}060$.

Chebyshev assumes only finite variance and is tight only for pathological distributions; Hoeffding exploits boundedness and tightens the tail dramatically. When you know more, you get more — a general lesson in concentration theory.

Exercise 12 Stationary Distribution Markov Chains

A weather Markov chain has states {Sun, Rain}. Transitions: from Sun, $P(\text{Sun} \to \text{Rain}) = 0.1$. From Rain, $P(\text{Rain} \to \text{Sun}) = 0.5$. Find the stationary distribution and the expected return time to Sun.

Open solution

Transition matrix (rows = from, columns = to), ordering [Sun, Rain]:

$$P = \begin{pmatrix} 0.9 & 0.1 \\ 0.5 & 0.5 \end{pmatrix}.$$

A stationary distribution $\boldsymbol{\pi} = (\pi_S, \pi_R)$ satisfies $\boldsymbol{\pi} P = \boldsymbol{\pi}$ and $\pi_S + \pi_R = 1$. Component-wise:

$$\pi_S = 0.9\, \pi_S + 0.5\, \pi_R, \qquad \pi_R = 0.1\, \pi_S + 0.5\, \pi_R.$$

The first gives $0.1\, \pi_S = 0.5\, \pi_R$, i.e.\ $\pi_S = 5 \pi_R$. With the normalisation $6 \pi_R = 1$:

$$\pi_S = \tfrac{5}{6}, \qquad \pi_R = \tfrac{1}{6}.$$

Expected return time. For an irreducible, positive recurrent chain, the mean return time to state $i$ is $1/\pi_i$ (Kac's lemma). Thus:

$$\mathbb{E}[\text{return time to Sun}] = \frac{1}{\pi_S} = \frac{6}{5} = 1.2 \text{ days}.$$

That return time is small because the chain spends most of its time in Sun anyway; the punchline is the general principle, not the number.

Every MCMC algorithm is an engineered Markov chain whose stationary distribution is the target posterior. Metropolis–Hastings constructs a chain satisfying detailed balance for an arbitrary target — a genuine conjuring trick.
Exercise 13 Importance Sampling Monte Carlo

To estimate $\mu = \mathbb{E}_p[h(X)]$ when sampling from $p$ is costly, we draw $X_i \stackrel{\text{iid}}{\sim} q$ (with $q(x) \gt 0$ wherever $p(x) h(x) \ne 0$) and use the importance-sampled estimator:

$$\hat\mu = \frac{1}{n} \sum_{i=1}^n h(X_i)\, \frac{p(X_i)}{q(X_i)}.$$

Show that $\hat\mu$ is unbiased, derive its variance, and find the $q$ minimising that variance.

Open solution

Unbiasedness. For each $i$,

$$\mathbb{E}_q\!\left[h(X)\, \frac{p(X)}{q(X)}\right] = \int h(x)\, \frac{p(x)}{q(x)}\, q(x)\, dx = \int h(x)\, p(x)\, dx = \mu.$$

So $\mathbb{E}[\hat\mu] = \mu$ regardless of $q$ (given the support condition).

Variance. Using $\mathrm{Var}(\hat\mu) = \tfrac{1}{n} \mathrm{Var}_q\!\bigl[h(X) p(X)/q(X)\bigr]$:

$$\mathrm{Var}(\hat\mu) = \frac{1}{n}\!\left[\int \frac{h^2(x)\, p^2(x)}{q(x)}\, dx - \mu^2\right].$$

Optimal $q$. We minimise $\int h^2 p^2 / q$ subject to $\int q = 1$. A Lagrangian argument (or Cauchy–Schwarz) gives:

$$q^*(x) \propto |h(x)|\, p(x).$$

If $h \ge 0$, the minimum variance is zero: $q^*(x) = h(x) p(x)/\mu$ makes the importance weight $\mu$ for every sample, so $\hat\mu = \mu$ exactly. The catch is that finding $\mu$ to normalise $q^*$ is the original problem. But the structural lesson survives:

Sample preferentially where $|h(x)|\, p(x)$ is large.

Practical diagnostic. The effective sample size is $n_{\text{eff}} = (\sum w_i)^2 / \sum w_i^2$ where $w_i = p(X_i)/q(X_i)$. When $n_{\text{eff}} \ll n$, the estimate is dominated by a handful of extreme weights — a sign that $q$ is a poor proposal and the variance may even be infinite.

Exercise 14 The Maximum of Uniforms Order Statistics

Let $X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \mathrm{Uniform}(0, 1)$ and $M_n = \max_i X_i$. Find the CDF, PDF, and expectation of $M_n$. Then describe the limiting distribution of $n(1 - M_n)$ as $n \to \infty$.

Open solution

CDF. Max is at most $m$ iff all are at most $m$:

$$F_{M_n}(m) = P(M_n \le m) = P(X_1 \le m, \ldots, X_n \le m) = m^n, \quad m \in [0,1].$$

Differentiate: $f_{M_n}(m) = n\, m^{n-1}$ on $[0, 1]$. Expectation:

$$\mathbb{E}[M_n] = \int_0^1 m \cdot n\, m^{n-1}\, dm = \frac{n}{n+1}.$$

The max creeps toward $1$ but never arrives — the gap is $1/(n+1)$, of order $1/n$.

Limiting law. Consider $Y_n = n(1 - M_n)$. Its CDF:

$$P(Y_n \le y) = P\!\left(M_n \ge 1 - \tfrac{y}{n}\right) = 1 - \left(1 - \tfrac{y}{n}\right)^n \to 1 - e^{-y} \quad \text{for } y \ge 0.$$

So $Y_n \xrightarrow{d} \mathrm{Exp}(1)$. The gap between the max and the upper boundary is, on the $1/n$ scale, exponentially distributed.

This is a baby form of extreme value theory. The three possible limiting laws for suitably normalised maxima — Gumbel, Fréchet, Weibull — underlie value-at-risk modelling for tail losses, the statistics of floods, and the distribution of record insurance claims.
Exercise 15 Gambler's Ruin Martingales

A gambler starts with capital $k$ and plays a fair game: on each turn her wealth moves by $\pm 1$ with probability $\tfrac{1}{2}$ each, independently. She stops when her wealth reaches either $0$ (ruin) or $N$ (target), with $0 \lt k \lt N$. Find the probability she reaches $N$ before $0$, using a martingale argument.

Open solution

Let $S_n$ be her fortune after $n$ turns; $S_0 = k$. Each step $S_{n+1} - S_n$ is $\pm 1$ with equal probability, independent of the past. Hence $\{S_n\}$ is a martingale:

$$\mathbb{E}[S_{n+1} \mid S_1, \ldots, S_n] = S_n + \mathbb{E}[\pm 1] = S_n.$$

Let $\tau = \inf\{n : S_n \in \{0, N\}\}$. The chain is on a finite state space and irreducible between the boundaries, so $\tau \lt \infty$ a.s. and is bounded in expectation; $S_n$ is bounded by $N$. These conditions let us apply the optional stopping theorem:

$$\mathbb{E}[S_\tau] = \mathbb{E}[S_0] = k.$$

But $S_\tau$ takes only the values $0$ or $N$:

$$\mathbb{E}[S_\tau] = 0 \cdot P(\text{ruin}) + N \cdot P(\text{reach } N).$$

Equating:

$$\boxed{\,P(\text{reach } N \text{ before } 0) = \frac{k}{N}.\,}$$

Remark on ruin. Start from $k = 100$ aiming for $N = 200$: you succeed with probability $1/2$. Aim for $N = 1000$ from the same starting capital $k = 100$, and the probability drops to $1/10$. A fair game at the individual step is a losing proposition against a wealthier counterparty — the institutional edge in a casino.

The biased case. If $P(\text{win}) = p \ne 1/2$ with $q = 1-p$, $S_n$ is no longer a martingale, but $(q/p)^{S_n}$ is. The same optional-stopping logic yields:

$$P(\text{reach } N) = \frac{1 - (q/p)^k}{1 - (q/p)^N}.$$
This is the foundation of martingale pricing: discount a derivative's payoff under the risk-neutral measure, and its price is an expectation of a stopped martingale. Gambler's ruin is Black–Scholes in miniature.
Exercise 16 Borel–Cantelli Measure Theory

Let $X_1, X_2, \ldots$ be i.i.d.\ standard normal. Show that $\limsup_n X_n / \sqrt{2 \log n} = 1$ almost surely. (Just prove the $\le 1$ direction — it is the one that uses BC I.)

Open solution

Fix $\varepsilon \gt 0$ and let $a_n = (1 + \varepsilon)\sqrt{2 \log n}$. Define $A_n = \{X_n \ge a_n\}$. The Mills ratio gives the standard Gaussian tail bound:

$$P(X_n \ge a) \le \frac{1}{a \sqrt{2\pi}}\, e^{-a^2/2}, \qquad a \gt 0.$$

Substituting $a = a_n$:

$$P(A_n) \le \frac{1}{(1+\varepsilon)\sqrt{2\pi \cdot 2 \log n}}\, n^{-(1+\varepsilon)^2} = O\!\left(\frac{1}{n^{(1+\varepsilon)^2} \sqrt{\log n}}\right).$$

Since $(1 + \varepsilon)^2 \gt 1$, the sum $\sum_n P(A_n)$ converges. The first Borel–Cantelli lemma then gives:

$$P(A_n \text{ i.o.}) = 0,$$

i.e.\ $X_n \lt (1 + \varepsilon)\sqrt{2 \log n}$ eventually, a.s. Hence $\limsup_n X_n/\sqrt{2 \log n} \le 1 + \varepsilon$ a.s., and taking $\varepsilon \downarrow 0$ along rationals gives the stated upper bound.

The matching $\ge 1$ bound uses BC II (the independence half) applied to $A_n = \{X_n \ge (1 - \varepsilon)\sqrt{2 \log n}\}$: the sum now diverges, independence applies, and infinitely many $A_n$ occur.

Takeaway. $\sqrt{2 \log n}$ is the right scale for the maximum of $n$ i.i.d.\ Gaussians — the foundation of extreme-value results for light-tailed distributions.

Exercise 17 Change of Measure Radon–Nikodym

Let $\mathbb{P}$ make $X \sim \mathcal{N}(0, 1)$ and $\mathbb{Q}$ make $X \sim \mathcal{N}(\mu, 1)$, on the same measurable space. Find $d\mathbb{Q}/d\mathbb{P}$ and use it to compute $\mathbb{E}_{\mathbb{P}}[X \cdot \mathbf{1}_{\{X \gt c\}}]$ in terms of $\mathbb{Q}$-probabilities.

Open solution

Both measures have densities with respect to Lebesgue, call them $\varphi$ (standard normal) and $\varphi_\mu$ (shifted). The Radon–Nikodym derivative is their ratio:

$$\frac{d\mathbb{Q}}{d\mathbb{P}}(x) = \frac{\varphi_\mu(x)}{\varphi(x)} = \exp\!\left(\mu x - \tfrac{1}{2}\mu^2\right).$$

The rule: for any integrable $g$, $\mathbb{E}_{\mathbb{Q}}[g(X)] = \mathbb{E}_{\mathbb{P}}\!\bigl[g(X)\, (d\mathbb{Q}/d\mathbb{P})(X)\bigr]$. Running it in reverse, with $g(x) = x \cdot \mathbf{1}_{\{x \gt c\}}$ we want something of the form $\mathbb{E}_{\mathbb{P}}[g(X)] = \mathbb{E}_{\mathbb{Q}}[g(X)\, (d\mathbb{P}/d\mathbb{Q})(X)]$. Inverting the RN derivative:

$$\frac{d\mathbb{P}}{d\mathbb{Q}}(x) = \exp\!\left(-\mu x + \tfrac{1}{2}\mu^2\right).$$

Pick $\mu = 1$ (any nonzero $\mu$ works; this one cancels the $x$ neatly). Then:

$$\mathbb{E}_{\mathbb{P}}[X \cdot \mathbf{1}_{\{X \gt c\}}] = \mathbb{E}_{\mathbb{Q}}\!\left[X\, e^{-X + 1/2} \cdot \mathbf{1}_{\{X \gt c\}}\right].$$

Under $\mathbb{Q}$, $X = 1 + Z$ for $Z \sim \mathcal{N}(0,1)$, so $X\, e^{-X + 1/2} = (1 + Z)\, e^{-Z - 1/2}$. A direct Gaussian integral — after completing the square — yields:

$$\mathbb{E}_{\mathbb{P}}[X \cdot \mathbf{1}_{\{X \gt c\}}] = \varphi(c),$$

the standard normal PDF at $c$. (A one-line alternative: note that $X\,\varphi(X) = -\varphi'(X)$ and integrate.)

Moral. Changing measure is a tool for evaluating awkward integrals by shifting the "centre of gravity" of the distribution. In finance, this is exactly what the risk-neutral measure does: under $\mathbb{Q}$, stocks drift at the riskless rate, turning pricing into a cleaner expectation.

Exercise 18 Doob's Maximal Inequality Martingale Tools

Let $S_n = X_1 + \cdots + X_n$ be a sum of i.i.d.\ mean-zero random variables with $\mathrm{Var}(X_1) = \sigma^2 \lt \infty$. Show that for any $\lambda \gt 0$:

$$P\!\left(\max_{1 \le k \le n} |S_k| \ge \lambda\right) \le \frac{n \sigma^2}{\lambda^2}.$$
Open solution

$S_n$ is a martingale with respect to its natural filtration. Since $x \mapsto x^2$ is convex, conditional Jensen gives:

$$\mathbb{E}[S_{k+1}^2 \mid \mathcal{F}_k] \ge (\mathbb{E}[S_{k+1} \mid \mathcal{F}_k])^2 = S_k^2,$$

so $\{S_k^2\}$ is a non-negative submartingale. Apply Doob's maximal inequality to it at level $\lambda^2$:

$$P\!\left(\max_{1 \le k \le n} S_k^2 \ge \lambda^2\right) \le \frac{\mathbb{E}[S_n^2]}{\lambda^2}.$$

The event $\{\max_k S_k^2 \ge \lambda^2\}$ equals $\{\max_k |S_k| \ge \lambda\}$, and $\mathbb{E}[S_n^2] = n \sigma^2$ (variance of sum of independents). The claim follows.

Comparison with Chebyshev. The naïve Chebyshev bound $P(|S_n| \ge \lambda) \le n\sigma^2 / \lambda^2$ controls only the final position. Doob upgrades this to control over the entire trajectory, at no cost. That jump — from a pointwise bound to a uniform-in-time bound — is the signature superpower of martingale techniques, and appears throughout online learning regret analyses.

Exercise 19 Itô's Formula in Action Stochastic Calculus

Using Itô's formula, compute $d(W_t^2)$ and use the result to show $\mathbb{E}[W_t^2] = t$. Then compute $\mathbb{E}[W_t^4]$.

Open solution

Square. Apply Itô to $f(x) = x^2$ with $X_t = W_t$, so $\mu_t = 0$, $\sigma_t = 1$. Then $f'(x) = 2x$, $f''(x) = 2$:

$$d(W_t^2) = 2 W_t\, dW_t + \tfrac{1}{2} \cdot 2 \cdot 1 \cdot dt = 2 W_t\, dW_t + dt.$$

Integrating from $0$ to $t$:

$$W_t^2 = 2 \int_0^t W_s\, dW_s + t.$$

Take expectations. The Itô integral $\int_0^t W_s\, dW_s$ is a martingale starting at $0$, so its expectation vanishes, and:

$$\mathbb{E}[W_t^2] = t. \quad \checkmark$$

We have just recovered a fact we already knew from the Gaussian definition, but through a method that doesn't need the Gaussian structure — it works for any Itô process.

Fourth moment. Apply Itô to $g(x) = x^4$: $g'(x) = 4x^3$, $g''(x) = 12 x^2$. With $X_t = W_t$:

$$d(W_t^4) = 4 W_t^3\, dW_t + \tfrac{1}{2} \cdot 12\, W_t^2\, dt = 4 W_t^3\, dW_t + 6 W_t^2\, dt.$$

Integrate and take expectations. The Itô integral contributes zero:

$$\mathbb{E}[W_t^4] = 6 \int_0^t \mathbb{E}[W_s^2]\, ds = 6 \int_0^t s\, ds = 3 t^2.$$

Consistent with $W_t \sim \mathcal{N}(0, t)$ (fourth moment of a Gaussian is $3\sigma^4$). Itô's formula has effectively converted a moment computation into an ODE.

Takeaway. Itô's formula systematically reduces the moments of Itô processes to deterministic integrals — and this is the reason stochastic calculus can price options in closed form.

Exercise 20 Geometric Brownian Motion SDE · Quant

Verify by Itô's formula that $S_t = S_0 \exp\!\bigl((\mu - \tfrac{1}{2}\sigma^2) t + \sigma W_t\bigr)$ solves the SDE $dS_t = \mu S_t\, dt + \sigma S_t\, dW_t$. Then compute $\mathbb{E}[S_t]$ and $\mathrm{Var}(S_t)$.

Open solution

Verification. Set $X_t = (\mu - \tfrac{1}{2}\sigma^2) t + \sigma W_t$, so $dX_t = (\mu - \tfrac{1}{2}\sigma^2)\, dt + \sigma\, dW_t$. Apply Itô to $f(x) = S_0 e^x$, noting $f'(x) = f(x) = S_t$ and $f''(x) = S_t$, with diffusion coefficient of $X_t$ equal to $\sigma$:

$$dS_t = S_t\, dX_t + \tfrac{1}{2} S_t\, \sigma^2\, dt = S_t\bigl[(\mu - \tfrac{1}{2}\sigma^2)\, dt + \sigma\, dW_t\bigr] + \tfrac{1}{2} \sigma^2 S_t\, dt.$$

The $-\tfrac{1}{2}\sigma^2$ inside $dX_t$ cancels against the $+\tfrac{1}{2}\sigma^2$ from the Itô correction. What remains:

$$dS_t = \mu S_t\, dt + \sigma S_t\, dW_t. \quad \checkmark$$

The exponent's strange-looking $-\tfrac{1}{2}\sigma^2$ is exactly what the Itô correction demands.

Moments. $S_t = S_0 \exp(Y_t)$ where $Y_t \sim \mathcal{N}\bigl((\mu - \tfrac{1}{2}\sigma^2)t,\ \sigma^2 t\bigr)$. Use the log-normal MGF $\mathbb{E}[e^{aY}] = \exp\!\bigl(a\, m + \tfrac{1}{2} a^2 v\bigr)$ for $Y \sim \mathcal{N}(m, v)$:

$$\mathbb{E}[S_t] = S_0 \exp\!\bigl((\mu - \tfrac{1}{2}\sigma^2) t + \tfrac{1}{2} \sigma^2 t\bigr) = S_0\, e^{\mu t}.$$

A clean and reassuring result: the geometric-Brownian-motion asset grows at the deterministic rate $\mu$ in expectation, even though each path is noisy and the median grows slower (at $\mu - \tfrac{1}{2}\sigma^2$, the Itô-corrected drift). The gap $\tfrac{1}{2}\sigma^2$ is called the volatility drag.

For the second moment, apply the MGF again with $a = 2$:

$$\mathbb{E}[S_t^2] = S_0^2 \exp\!\bigl(2(\mu - \tfrac{1}{2}\sigma^2) t + 2 \sigma^2 t\bigr) = S_0^2\, e^{(2\mu + \sigma^2) t}.$$

Subtracting:

$$\mathrm{Var}(S_t) = S_0^2\, e^{2 \mu t}\bigl(e^{\sigma^2 t} - 1\bigr).$$
Under the risk-neutral measure $\mathbb{Q}$, $\mu$ is replaced by the riskless rate $r$, and Black–Scholes prices a European call as $\mathbb{E}_{\mathbb{Q}}[e^{-rT}(S_T - K)^+]$. Computing that expectation against the log-normal distribution derived here produces the Black–Scholes formula in about a page of algebra.
Exercise 21 Two-Sample Z-Test & Power Testing

An A/B test compares two conversion rates. Group $A$ has $n_A$ users with sample mean $\bar X_A$; group $B$ has $n_B$ users with sample mean $\bar X_B$. Assume known variances $\sigma_A^2, \sigma_B^2$ for simplicity. Derive the test statistic for $H_0: \mu_A = \mu_B$ against $H_1: \mu_A \gt \mu_B$, and compute the sample size needed for power $1 - \beta$ at a specified effect size $\Delta = \mu_A - \mu_B$.

Open solution

Test statistic. By the CLT (or exactly, if data are Gaussian), $\bar X_A - \bar X_B$ is approximately normal with mean $\mu_A - \mu_B$ and variance $\sigma_A^2/n_A + \sigma_B^2/n_B$. Under $H_0$:

$$Z = \frac{\bar X_A - \bar X_B}{\sqrt{\sigma_A^2/n_A + \sigma_B^2/n_B}} \sim \mathcal{N}(0, 1).$$

Reject $H_0$ at level $\alpha$ when $Z \gt z_\alpha$.

Power calculation. Under $H_1$ with true difference $\Delta$, $Z$ shifts:

$$Z \sim \mathcal{N}\!\left(\frac{\Delta}{\sqrt{\sigma_A^2/n_A + \sigma_B^2/n_B}},\ 1\right).$$

Power is:

$$1 - \beta = P(Z \gt z_\alpha \mid H_1) = \Phi\!\left(\frac{\Delta}{\sqrt{\sigma_A^2/n_A + \sigma_B^2/n_B}} - z_\alpha\right).$$

Sample size. Take $n_A = n_B = n$ and $\sigma_A = \sigma_B = \sigma$. Setting $\Phi^{-1}(1 - \beta) = z_\beta$ and rearranging:

$$n = \frac{2 \sigma^2 (z_\alpha + z_\beta)^2}{\Delta^2}.$$

Detecting half the effect requires four times the sample — the same square-root law that haunted Exercise 9. Shrinking $\sigma$ (e.g.\ by better instrumentation or stratification) is the most efficient lever, because $n$ depends on $\sigma^2$.

The peeking problem: running this test repeatedly and stopping the first time $Z$ crosses $z_\alpha$ inflates the Type I error dramatically. Proper sequential testing (SPRT, always-valid p-values, or group-sequential designs) is the fix; naïve repetition is the industry-wide bug.
Exercise 22 Fisher Information for Gaussians Estimation

Let $X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu, \sigma^2)$ with both parameters unknown. Compute the $2 \times 2$ Fisher information matrix $\mathcal{I}(\mu, \sigma^2)$. Show the sample mean achieves the Cramér–Rao bound for $\mu$; find the CRB for any unbiased estimator of $\sigma^2$.

Open solution

Per-observation log-density:

$$\log f(x; \mu, \sigma^2) = -\tfrac{1}{2} \log(2\pi \sigma^2) - \frac{(x - \mu)^2}{2 \sigma^2}.$$

First derivatives (the score):

$$\frac{\partial \log f}{\partial \mu} = \frac{x - \mu}{\sigma^2}, \qquad \frac{\partial \log f}{\partial \sigma^2} = -\frac{1}{2\sigma^2} + \frac{(x - \mu)^2}{2 \sigma^4}.$$

Second derivatives:

$$\frac{\partial^2 \log f}{\partial \mu^2} = -\frac{1}{\sigma^2}, \qquad \frac{\partial^2 \log f}{\partial \mu\, \partial \sigma^2} = -\frac{x - \mu}{\sigma^4}, \qquad \frac{\partial^2 \log f}{\partial (\sigma^2)^2} = \frac{1}{2\sigma^4} - \frac{(x - \mu)^2}{\sigma^6}.$$

Taking $-\mathbb{E}$ kills the off-diagonal (since $\mathbb{E}[X - \mu] = 0$) and simplifies the diagonal using $\mathbb{E}[(X-\mu)^2] = \sigma^2$:

$$\mathcal{I}(\mu, \sigma^2) = \begin{pmatrix} 1/\sigma^2 & 0 \\ 0 & 1/(2\sigma^4) \end{pmatrix}.$$

The orthogonality of $\mu$ and $\sigma^2$ in the Fisher metric is what makes Gaussian inference so clean — the MLEs of $\mu$ and $\sigma^2$ are asymptotically independent.

CRBs. With $n$ observations the Fisher information scales by $n$. For any unbiased estimator:

$$\mathrm{Var}(\hat\mu) \ge \frac{\sigma^2}{n}, \qquad \mathrm{Var}(\widehat{\sigma^2}) \ge \frac{2 \sigma^4}{n}.$$

The sample mean has variance exactly $\sigma^2/n$ — efficient. The unbiased sample variance $S^2 = \tfrac{1}{n-1}\sum(X_i - \bar X)^2$ has variance $2\sigma^4/(n-1)$, slightly exceeding the CRB; no unbiased estimator achieves it exactly.

Exercise 23 Change of Variables Transformations

Let $X \sim \mathcal{N}(\mu, \sigma^2)$ and define $Y = e^X$. Derive the density of $Y$ (the log-normal), then compute $\mathbb{E}[Y]$ two different ways: by direct integration and by recognising the normal MGF.

Open solution

Density. The map $g(x) = e^x$ is a smooth, strictly-increasing bijection $\mathbb{R} \to (0, \infty)$, with inverse $g^{-1}(y) = \log y$ and Jacobian $\tfrac{d}{dy} \log y = 1/y$. Applying the change-of-variables formula:

$$f_Y(y) = f_X(\log y)\, \left|\frac{1}{y}\right| = \frac{1}{y\, \sigma \sqrt{2\pi}}\, \exp\!\left(-\frac{(\log y - \mu)^2}{2\sigma^2}\right), \quad y \gt 0.$$

Expectation via MGF. By definition, $\mathbb{E}[Y] = \mathbb{E}[e^X] = M_X(1)$, where the MGF of $\mathcal{N}(\mu, \sigma^2)$ is $M_X(t) = \exp(\mu t + \tfrac{1}{2}\sigma^2 t^2)$. Substituting $t = 1$:

$$\mathbb{E}[Y] = \exp\!\left(\mu + \tfrac{1}{2}\sigma^2\right).$$

Expectation by direct integration. Complete the square:

$$\mathbb{E}[Y] = \int_{-\infty}^\infty \frac{e^x}{\sigma \sqrt{2\pi}}\, \exp\!\left(-\frac{(x - \mu)^2}{2 \sigma^2}\right) dx.$$

Combining exponents: $-\tfrac{(x - \mu)^2}{2\sigma^2} + x = -\tfrac{1}{2\sigma^2}\!\left[x - (\mu + \sigma^2)\right]^2 + \mu + \tfrac{1}{2}\sigma^2$. The remaining integrand is a Gaussian density centred at $\mu + \sigma^2$ and integrates to $1$:

$$\mathbb{E}[Y] = \exp\!\left(\mu + \tfrac{1}{2}\sigma^2\right). \checkmark$$

The $+\tfrac{1}{2}\sigma^2$ is the same quantity that appeared with opposite sign in the GBM solution (Exercise 20). Both are fingerprints of Itô / Jensen working in a world where $\log$ is concave: you cannot freely move a log across an expectation.

Exercise 24 Mutual Information of a Channel Information

A binary symmetric channel transmits $X \in \{0, 1\}$ and outputs $Y = X \oplus N$ where $\oplus$ is XOR and $N \sim \mathrm{Bernoulli}(\varepsilon)$ is an independent noise bit ($0 \le \varepsilon \le 1/2$). Compute $I(X; Y)$ when $X \sim \mathrm{Bernoulli}(p)$, and find the channel capacity $C = \max_p I(X; Y)$.

Open solution

Write $H_2(q) = -q \log q - (1-q) \log(1-q)$ for the binary entropy function (base 2 gives bits; use natural log for nats — all formulas scale).

Conditional entropy. Given $X = x$, $Y = x \oplus N$, so $Y \mid X \sim \mathrm{Bernoulli}(\varepsilon)$ regardless of $x$. Hence:

$$H(Y \mid X) = H_2(\varepsilon).$$

Marginal entropy. $P(Y = 1) = P(X = 1) P(N = 0) + P(X = 0) P(N = 1) = p(1 - \varepsilon) + (1 - p) \varepsilon$. Call this $q$. Then:

$$H(Y) = H_2(q), \qquad q = p(1 - \varepsilon) + (1 - p) \varepsilon.$$

Mutual information.

$$I(X; Y) = H(Y) - H(Y \mid X) = H_2(q) - H_2(\varepsilon).$$

Capacity. $H_2(q)$ is maximised at $q = 1/2$, which occurs when $p = 1/2$ (and any $\varepsilon$). So:

$$C = 1 - H_2(\varepsilon) \quad \text{(bits per channel use)}.$$

Limits:

  • $\varepsilon = 0$ (perfect channel): $C = 1$ bit.
  • $\varepsilon = 1/2$ (output independent of input): $C = 0$.
  • $\varepsilon = 1$ (bit-flipped deterministically): $C = 1$ again — just invert the receiver.
Shannon's noisy-channel coding theorem promises: any rate $R \lt C$ is achievable with vanishing error probability, and any $R \gt C$ is not. Modern codes (turbo, LDPC, polar) approach capacity in practice. In ML, this formalism underlies the information bottleneck principle for representation learning.
Exercise 25 Poisson Superposition & Thinning Stochastic Processes

Two independent Poisson processes $N_1, N_2$ on $[0, \infty)$ have rates $\lambda_1, \lambda_2$. (a) Show that the superposition $N = N_1 + N_2$ is a Poisson process and find its rate. (b) Given an event from $N$ occurs, what is the probability it came from process $1$? (c) Consider thinning: in a rate-$\lambda$ Poisson process, each event is independently kept with probability $p$. Show the kept events form a Poisson process with rate $\lambda p$.

Open solution

(a) Superposition. $N$ starts at $0$. For disjoint intervals, the increments of $N_1$ and $N_2$ are independent of each other and across intervals, so $N$ has independent increments too. For any interval of length $t$:

$$N_1(t) + N_2(t) \sim \mathrm{Poisson}(\lambda_1 t) + \mathrm{Poisson}(\lambda_2 t) \stackrel{d}{=} \mathrm{Poisson}((\lambda_1 + \lambda_2) t),$$

using the fact that independent Poissons sum to a Poisson (check via MGFs). So $N$ is Poisson with rate $\lambda_1 + \lambda_2$.

(b) Colouring. Given an event happens in the superposition at some instant, by symmetry and independence the probability it is from process $1$ is $\lambda_1 / (\lambda_1 + \lambda_2)$ — independently for each event. Each arrival's origin is an i.i.d.\ categorical draw.

(c) Thinning. Let $M(t)$ be the number of kept events in $[0, t]$. Conditional on $N(t) = n$, $M(t) \sim \mathrm{Binomial}(n, p)$. Unconditioning:

$$P(M(t) = k) = \sum_{n = k}^\infty \frac{(\lambda t)^n e^{-\lambda t}}{n!} \binom{n}{k} p^k (1 - p)^{n - k}.$$

Factor out $k$-dependent terms and reindex with $m = n - k$:

$$= \frac{(\lambda t p)^k e^{-\lambda t}}{k!} \sum_{m = 0}^\infty \frac{(\lambda t (1 - p))^m}{m!} = \frac{(\lambda p t)^k e^{-\lambda p t}}{k!}.$$

That is Poisson$(\lambda p t)$. Independence of increments is inherited from $N$, so the kept process is itself Poisson with rate $\lambda p$. The discarded events form an independent Poisson process with rate $\lambda(1 - p)$.

Thinning is how credit-risk models combine defaults across a portfolio: each loan is a Bernoulli thinning of a common hazard process. Superposition is how merged order flow aggregates across venues in market-microstructure modelling.
Exercise 26 Exponential Family & Sufficient Statistics Estimation

Cast the Bernoulli distribution in exponential-family form $p(x \mid \eta) = h(x)\, \exp\!\bigl(\eta\, T(x) - A(\eta)\bigr)$. Identify the natural parameter $\eta$, sufficient statistic $T(x)$, log-partition $A(\eta)$, and verify the identity $A'(\eta) = \mathbb{E}[T]$.

Open solution

Start from $p(x \mid \theta) = \theta^x (1 - \theta)^{1-x}$ for $x \in \{0, 1\}$. Take logs and re-arrange:

$$\log p(x \mid \theta) = x \log \theta + (1 - x) \log (1 - \theta) = x\, \log\!\frac{\theta}{1 - \theta} + \log(1 - \theta).$$

Identify:

  • Natural parameter: $\eta = \log \frac{\theta}{1 - \theta}$ — the logit of $\theta$.
  • Sufficient statistic: $T(x) = x$.
  • Base measure: $h(x) = 1$.
  • Log-partition: $A(\eta) = -\log(1 - \theta) = \log(1 + e^\eta)$ — the softplus.

So $p(x \mid \eta) = \exp(\eta x - \log(1 + e^\eta))$ — a compact form that immediately tells us something useful.

The cumulant identity. For any exponential family, $A(\eta)$ is convex in $\eta$ and its derivatives generate cumulants of $T$. In particular:

$$A'(\eta) = \mathbb{E}[T(X)], \qquad A''(\eta) = \mathrm{Var}(T(X)).$$

Check for Bernoulli: $A'(\eta) = \dfrac{e^\eta}{1 + e^\eta} = \sigma(\eta)$, the logistic sigmoid. And $\sigma(\eta) = \theta = \mathbb{E}[X]$. ✓ The second derivative: $A''(\eta) = \sigma(\eta)(1 - \sigma(\eta)) = \theta(1 - \theta) = \mathrm{Var}(X)$. ✓

Consequences. For $n$ i.i.d.\ observations, $\sum_i T(X_i) = \sum_i X_i$ is sufficient (Fisher–Neyman factorisation). The MLE in terms of $\eta$ satisfies $A'(\hat\eta) = \bar T = \bar X$, i.e.\ $\hat\theta = \bar X$ — the logit of the sample mean is the MLE of the natural parameter. Fisher information in natural parameters is $A''(\eta)$.

Exponential-family structure organises most of applied statistics: linear regression (Gaussian), logistic regression (Bernoulli), Poisson regression, gamma GLMs, softmax/multinomial — all share this spine. Gradient descent on log-likelihood is elegant precisely because $A$ is convex.
Exercise 27 The Delta Method Asymptotics

Let $X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \mathrm{Exp}(\lambda)$. In Exercise 4 we showed the MLE is $\hat\lambda = 1/\bar X_n$. Derive its asymptotic distribution via the delta method, starting from the CLT for $\bar X_n$.

Open solution

Step 1 — CLT for the mean. $X_i$ has mean $1/\lambda$ and variance $1/\lambda^2$. By the CLT:

$$\sqrt{n}\!\left(\bar X_n - \frac{1}{\lambda}\right) \xrightarrow{d} \mathcal{N}\!\left(0,\ \frac{1}{\lambda^2}\right).$$

Step 2 — Delta method. Apply $g(x) = 1/x$, which is differentiable at $x = 1/\lambda$ with $g'(x) = -1/x^2$, so $g'(1/\lambda) = -\lambda^2$. The delta method gives:

$$\sqrt{n}\!\left(\frac{1}{\bar X_n} - \lambda\right) \xrightarrow{d} \mathcal{N}\!\left(0,\ [g'(1/\lambda)]^2 \cdot \frac{1}{\lambda^2}\right) = \mathcal{N}(0,\ \lambda^2).$$

This matches precisely the MLE asymptotic variance we computed in Exercise 4 from Fisher information. Two paths to the same destination: the delta method (first-order Taylor expansion plus Slutsky) and the Fisher-information formula. They agree because MLEs are asymptotically normal with variance equal to the inverse Fisher information — and the $g$-transformation Fisher information transforms covariantly.

Constructing a CI. The delta-method form gives an asymptotic $(1 - \alpha)$ confidence interval:

$$\hat\lambda \pm z_{\alpha/2} \cdot \frac{\hat\lambda}{\sqrt n}.$$

We substitute the unknown $\lambda$ by $\hat\lambda$ in the standard error — a step justified by Slutsky's theorem.

When $g$ is monotone, an alternative (often tighter) interval comes from transforming a symmetric CI for $\bar X$: compute it on the natural scale, then apply $g$. This is the pivot method and often outperforms the delta-method interval in small samples.
Exercise 28 Bootstrap Confidence Interval Resampling

Given i.i.d.\ observations $X_1, \ldots, X_n \sim F$ with unknown mean $\mu = \mathbb{E}_F[X]$, the percentile bootstrap CI for $\mu$ is: resample with replacement $B$ times to form datasets $\mathbf{X}^{(b)}$; compute $\bar X^{(b)}$ for each; report the empirical $(\alpha/2, 1 - \alpha/2)$ quantiles of $\{\bar X^{(b)}\}_{b=1}^B$. Justify why this procedure is (approximately) a valid $(1 - \alpha)$ CI, and give one setting where it fails catastrophically.

Open solution

The plug-in principle. The unknown sampling distribution of $\bar X - \mu$ under $F$ is approximated by the conditional distribution of $\bar X^* - \bar X_n$ under the empirical $F_n$ (with $\bar X_n$ the original sample mean). As $n \to \infty$, $F_n \to F$ uniformly (Glivenko–Cantelli), so the approximation is sound for smooth functionals.

Why quantiles work. Write $\hat\theta = \bar X_n$ and $\hat\theta^* = \bar X^*$. If the distribution of $\hat\theta - \mu$ is well approximated by that of $\hat\theta^* - \hat\theta$, then for quantile $q_\alpha$ of the bootstrap distribution of $\hat\theta^* - \hat\theta$:

$$P\!\bigl(q_{\alpha/2} \le \hat\theta - \mu \le q_{1 - \alpha/2}\bigr) \approx 1 - \alpha.$$

Rearranging gives $\hat\theta - q_{1 - \alpha/2} \le \mu \le \hat\theta - q_{\alpha/2}$ — the basic bootstrap. The percentile bootstrap uses quantiles of $\hat\theta^*$ directly; the two agree when the distribution is symmetric around $\hat\theta$.

Edgeworth refinement. Bootstrap has $O(1/n)$ coverage error, one order better than normal-approximation CIs, because it captures skewness automatically. This is one reason it is the default in many applied fields.

Where it fails.

  • Extremes. For $\hat\theta = \max X_i$, the bootstrap is inconsistent: resamples can never exceed the observed maximum, so bootstrap variance is systematically wrong. Subsampling or the $m$-out-of-$n$ bootstrap fixes this.
  • Heavy tails. If $\mathbb{E}[X^2] = \infty$, the CLT itself fails, and so does the bootstrap for the sample mean.
  • Dependent data. Naïve bootstrap destroys temporal or spatial correlation. Use the block bootstrap for time series.
In ML, bootstrapping provides model uncertainty for test-set metrics (AUC, accuracy) without strong parametric assumptions, and bagging — a foundational ensemble technique — is the bootstrap applied to learners.
Exercise 29 EM for a Gaussian Mixture ML

A two-component Gaussian mixture has density $p(x) = \pi\, \mathcal{N}(x; \mu_1, 1) + (1 - \pi)\, \mathcal{N}(x; \mu_2, 1)$ with known variances ($=1$) and mixing weight $\pi$. Given i.i.d.\ data $x_1, \ldots, x_n$, derive the EM updates for $(\mu_1, \mu_2)$.

Open solution

Introduce latent variables $z_i \in \{1, 2\}$ indicating which component generated $x_i$. If we knew $z_i$, the complete-data log-likelihood would be trivially separable into two Gaussian likelihoods, each giving $\mu_k = $ mean of points assigned to component $k$.

E-step. Given current parameters, compute the posterior responsibility:

$$\gamma_i = P(z_i = 1 \mid x_i) = \frac{\pi\, \mathcal{N}(x_i; \mu_1, 1)}{\pi\, \mathcal{N}(x_i; \mu_1, 1) + (1 - \pi)\, \mathcal{N}(x_i; \mu_2, 1)}.$$

This is the soft assignment.

M-step. Maximise the expected complete-data log-likelihood:

$$Q(\mu_1, \mu_2) = \sum_i \gamma_i \left[-\tfrac{1}{2}(x_i - \mu_1)^2\right] + \sum_i (1 - \gamma_i) \left[-\tfrac{1}{2}(x_i - \mu_2)^2\right] + \text{const}.$$

Setting $\partial Q/\partial \mu_k = 0$:

$$\mu_1^{\text{new}} = \frac{\sum_i \gamma_i\, x_i}{\sum_i \gamma_i}, \qquad \mu_2^{\text{new}} = \frac{\sum_i (1 - \gamma_i)\, x_i}{\sum_i (1 - \gamma_i)}.$$

Each mean is a responsibility-weighted average — a soft version of $k$-means.

Why EM works. Let $\ell(\theta) = \log p(x \mid \theta)$ be the observed-data log-likelihood. For any $q(z)$:

$$\ell(\theta) = \underbrace{\mathbb{E}_q[\log p(x, z \mid \theta)] - \mathbb{E}_q[\log q(z)]}_{\text{ELBO}} + \underbrace{D_{\mathrm{KL}}(q(z) \,\|\, p(z \mid x, \theta))}_{\ge 0}.$$

E-step minimises the KL term by setting $q(z) = p(z \mid x, \theta^{\text{old}})$ — the ELBO becomes tight. M-step maximises the ELBO in $\theta$. Both steps increase (or preserve) $\ell$, so EM is monotone — it cannot decrease the likelihood, and converges to a stationary point (not necessarily the global maximum; initialisation matters).

Exercise 30 The Variational Lower Bound Variational Inference

Given a latent-variable model $p(x, z \mid \theta)$ with marginal likelihood $p(x \mid \theta) = \int p(x, z \mid \theta)\, dz$, and any distribution $q(z)$ on the latents, derive the evidence lower bound (ELBO) on $\log p(x \mid \theta)$. Characterise exactly when the bound is tight.

Open solution

Derivation via Jensen. Multiply and divide by $q(z)$ inside the integral, then apply Jensen's inequality (log is concave):

$$\log p(x) = \log \int q(z)\, \frac{p(x, z)}{q(z)}\, dz \ge \int q(z) \log \frac{p(x, z)}{q(z)}\, dz =: \mathcal{L}(q, \theta).$$

The right-hand side is the ELBO. Equivalently:

$$\mathcal{L}(q, \theta) = \mathbb{E}_q[\log p(x \mid z, \theta)] - D_{\mathrm{KL}}(q(z) \,\|\, p(z \mid \theta)).$$

The first term is a reconstruction-like likelihood under the latent posterior $q$; the second penalises departure of $q$ from the prior on $z$. This decomposition is what the VAE optimises.

The gap. A short manipulation:

$$\log p(x) - \mathcal{L}(q, \theta) = \int q(z)\, \log \frac{q(z)}{p(z \mid x, \theta)}\, dz = D_{\mathrm{KL}}(q(z) \,\|\, p(z \mid x, \theta)).$$

So the bound is tight if and only if $q(z) = p(z \mid x, \theta)$ — i.e.\ $q$ is the true posterior. When the true posterior is intractable, we restrict $q$ to a tractable family and minimise the KL, equivalently maximising the ELBO.

Two classical regimes.

  • EM (Exercise 29): we can compute $p(z \mid x, \theta)$ exactly. Set $q = p(z \mid x, \theta^{\text{old}})$ (E-step), then maximise $\mathcal{L}$ in $\theta$ (M-step). Alternating on a tight bound.
  • Variational Bayes / VAEs: parametrise $q_\phi(z \mid x)$ by a neural network and jointly maximise $\mathcal{L}$ in both $\theta$ and $\phi$ via stochastic gradients (the reparameterisation trick makes this differentiable).
The same decomposition of $\log p(x)$ into (lower bound) $+$ (KL gap) underlies modern diffusion models, normalising flows, contrastive learning ($\log$ partition approximations), and even large-language-model training objectives where tractable surrogates stand in for intractable maximum-likelihood problems. Variational reasoning is one of the most productive ideas modern ML has inherited from classical statistics.