The Cheatsheet
§ IFoundations of Probability
A probability space is a triple $(\Omega, \mathcal{F}, P)$: a sample space $\Omega$ of outcomes, a $\sigma$-algebra $\mathcal{F}$ of admissible events, and a measure $P$ assigning each event a number in $[0, 1]$. The structure looks abstract but it buys us two things: a rigorous notion of "event" (so we can talk about countable unions without contradictions) and the ability to condition cleanly.
Axioms
- $P(A) \ge 0$ for every $A \in \mathcal{F}$.
- $P(\Omega) = 1$.
- For disjoint $A_1, A_2, \ldots$: $P\!\left(\bigcup_i A_i\right) = \sum_i P(A_i)$.
Conditioning and independence
When $P(B) \gt 0$, $P(A \mid B) = \dfrac{P(A \cap B)}{P(B)}$. Two events are independent iff $P(A \cap B) = P(A)\, P(B)$, equivalently $P(A \mid B) = P(A)$. Independence is an assumption — never an observation.
The two workhorses
- Law of total probability
- For a partition $\{B_i\}$ of $\Omega$: $\ P(A) = \sum_i P(A \mid B_i)\, P(B_i)$.
- Bayes' theorem
- $P(B \mid A) = \dfrac{P(A \mid B)\, P(B)}{P(A)} = \dfrac{P(A \mid B)\, P(B)}{\sum_i P(A \mid B_i)\, P(B_i)}$.
- Chain rule
- $P(A_1 \cap \cdots \cap A_n) = \prod_{i=1}^{n} P(A_i \mid A_1, \ldots, A_{i-1})$.
§ IIRandom Variables
A random variable $X: \Omega \to \mathbb{R}$ is a measurable function. In practice we never care about $\Omega$ itself — we work with the induced distribution of $X$ through its CDF $F_X(x) = P(X \le x)$, which determines every probabilistic statement about $X$.
| Quantity | Discrete | Continuous |
|---|---|---|
| Probability function | PMF $p(x) = P(X = x)$ | PDF $f(x)$, with $P(X \in A) = \int_A f$ |
| Normalization | $\sum_x p(x) = 1$ | $\int f(x)\, dx = 1$ |
| Expectation | $\mathbb{E}[X] = \sum_x x\, p(x)$ | $\mathbb{E}[X] = \int x\, f(x)\, dx$ |
| LOTUS | $\mathbb{E}[g(X)] = \sum_x g(x)\, p(x)$ | $\mathbb{E}[g(X)] = \int g(x)\, f(x)\, dx$ |
Moments and the moment-generating function
The $k$-th moment is $\mu_k = \mathbb{E}[X^k]$; the variance $\sigma^2 = \mathrm{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$. The moment-generating function $M_X(t) = \mathbb{E}[e^{tX}]$, when it exists in a neighborhood of $0$, determines the distribution uniquely and satisfies $M_X^{(k)}(0) = \mu_k$. For sums of independents, $M_{X+Y}(t) = M_X(t)\, M_Y(t)$.
Core identities
- Linearity
- $\mathbb{E}[aX + bY + c] = a\, \mathbb{E}[X] + b\, \mathbb{E}[Y] + c$ — always, no independence needed.
- Variance of an affine map
- $\mathrm{Var}(aX + b) = a^2\, \mathrm{Var}(X)$.
- Variance of a sum
- $\mathrm{Var}(X + Y) = \mathrm{Var}(X) + \mathrm{Var}(Y) + 2\,\mathrm{Cov}(X, Y)$.
- Change of variables
- If $Y = g(X)$ with $g$ smooth and monotone, $f_Y(y) = f_X(g^{-1}(y))\, \bigl|\tfrac{d}{dy} g^{-1}(y)\bigr|$.
§ IIIA Menagerie of Distributions
Most of applied probability is the fluent recognition of a handful of distributional shapes. The table below lists the ones worth knowing by heart — their parameters, first two moments, and the situations in which they naturally arise.
| Name | PMF | Mean | Variance | Arises as |
|---|---|---|---|---|
| Bernoulli$(p)$ | $p^x (1-p)^{1-x},\ x\in\{0,1\}$ | $p$ | $p(1-p)$ | A single coin flip |
| Binomial$(n, p)$ | $\binom{n}{x} p^x (1-p)^{n-x}$ | $np$ | $np(1-p)$ | Sum of $n$ Bernoullis |
| Geometric$(p)$ | $(1-p)^{x-1} p,\ x\ge 1$ | $1/p$ | $(1-p)/p^2$ | Trials until first success |
| Poisson$(\lambda)$ | $\dfrac{\lambda^x e^{-\lambda}}{x!}$ | $\lambda$ | $\lambda$ | Rare events in fixed interval |
| Negative Binomial$(r, p)$ | $\binom{x-1}{r-1} p^r (1-p)^{x-r}$ | $r/p$ | $r(1-p)/p^2$ | Trials until $r$-th success |
| Name | Mean | Variance | Arises as | |
|---|---|---|---|---|
| Uniform$(a, b)$ | $\dfrac{1}{b-a}$ | $\tfrac{a+b}{2}$ | $\tfrac{(b-a)^2}{12}$ | "No information" on an interval |
| Normal$(\mu, \sigma^2)$ | $\dfrac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$ | $\mu$ | $\sigma^2$ | Sums of many small effects (CLT) |
| Exponential$(\lambda)$ | $\lambda e^{-\lambda x},\ x \ge 0$ | $1/\lambda$ | $1/\lambda^2$ | Memoryless waiting time |
| Gamma$(k, \theta)$ | $\dfrac{x^{k-1} e^{-x/\theta}}{\theta^k \Gamma(k)}$ | $k\theta$ | $k\theta^2$ | Sum of $k$ Exp$(1/\theta)$s |
| Beta$(\alpha, \beta)$ | $\dfrac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha, \beta)}$ | $\tfrac{\alpha}{\alpha+\beta}$ | $\tfrac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}$ | Conjugate prior for Bernoulli |
| Chi-squared $\chi^2_k$ | Gamma$(k/2, 2)$ density | $k$ | $2k$ | $\sum_{i=1}^k Z_i^2$ with $Z_i \sim N(0,1)$ |
| Student-$t_\nu$ | $\propto \left(1 + \tfrac{x^2}{\nu}\right)^{-\frac{\nu+1}{2}}$ | $0$ ($\nu\gt 1$) | $\tfrac{\nu}{\nu-2}$ ($\nu\gt 2$) | Heavy-tailed $N/\sqrt{\chi^2/\nu}$ |
Multivariate Normal
$\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$ with density $f(\mathbf{x}) = (2\pi)^{-d/2}|\boldsymbol{\Sigma}|^{-1/2} \exp\!\bigl(-\tfrac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^{\!\top}\boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\bigr)$. Two key facts carry most of the weight in ML:
- Affine closure. If $\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$, then $A\mathbf{X} + \mathbf{b} \sim \mathcal{N}(A\boldsymbol{\mu} + \mathbf{b},\, A\boldsymbol{\Sigma} A^{\!\top})$.
- Marginals & conditionals are Gaussian — closed forms in §4.
§ IVJoint, Marginal, Conditional
For a pair $(X, Y)$ the joint density $f_{X,Y}(x, y)$ generates the marginals by integration and the conditionals by division:
$$f_X(x) = \int f_{X,Y}(x, y)\, dy, \qquad f_{Y \mid X}(y \mid x) = \frac{f_{X,Y}(x, y)}{f_X(x)}.$$$X$ and $Y$ are independent iff $f_{X,Y}(x, y) = f_X(x)\, f_Y(y)$.
Covariance and correlation
- Covariance
- $\mathrm{Cov}(X, Y) = \mathbb{E}[(X - \mu_X)(Y - \mu_Y)] = \mathbb{E}[XY] - \mu_X \mu_Y$.
- Correlation
- $\rho(X, Y) = \mathrm{Cov}(X, Y) / (\sigma_X \sigma_Y) \in [-1, 1]$.
- Bilinearity
- $\mathrm{Cov}(aX + b,\, cY + d) = ac\, \mathrm{Cov}(X, Y)$.
- Covariance matrix
- $\boldsymbol{\Sigma}_{ij} = \mathrm{Cov}(X_i, X_j)$; always symmetric positive semidefinite.
Correlation measures linear dependence only. Independent $\Rightarrow$ uncorrelated; the converse fails in general (Gaussians excepted).
Conditional expectation
$\mathbb{E}[Y \mid X]$ is itself a random variable — a function of $X$. Two identities carry most of the analytic weight:
- Tower / Adam's law
- $\mathbb{E}[Y] = \mathbb{E}\!\bigl[\mathbb{E}[Y \mid X]\bigr]$.
- Eve's law (total variance)
- $\mathrm{Var}(Y) = \mathbb{E}\!\bigl[\mathrm{Var}(Y \mid X)\bigr] + \mathrm{Var}\!\bigl(\mathbb{E}[Y \mid X]\bigr)$.
The second decomposition is a decomposition of uncertainty: noise within each conditional world, plus variation across worlds.
Gaussian conditionals
Partition $\mathbf{X} = (\mathbf{X}_1, \mathbf{X}_2) \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$ with the obvious block structure. Then:
$$\mathbf{X}_1 \mid \mathbf{X}_2 = \mathbf{x}_2 \sim \mathcal{N}\!\bigl(\boldsymbol{\mu}_1 + \boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1}(\mathbf{x}_2 - \boldsymbol{\mu}_2),\ \boldsymbol{\Sigma}_{11} - \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21}\bigr).$$§ VInequalities & Concentration
Inequalities are the skeleton on which proofs in learning theory hang. The four below cover the vast majority of everyday use; the remaining two are the tools of the high-dimensional trade.
- Markov
- For $X \ge 0$ and $a \gt 0$: $P(X \ge a) \le \mathbb{E}[X] / a$.
- Chebyshev
- $P(|X - \mu| \ge k\sigma) \le 1/k^2$.
- Jensen
- If $\varphi$ is convex, $\varphi(\mathbb{E}[X]) \le \mathbb{E}[\varphi(X)]$.
- Cauchy–Schwarz
- $(\mathbb{E}[XY])^2 \le \mathbb{E}[X^2]\, \mathbb{E}[Y^2]$.
High-dimensional tools
- Hoeffding
- For independent $X_i \in [a_i, b_i]$ and $S = \sum X_i$: $P\bigl(|S - \mathbb{E}S| \ge t\bigr) \le 2 \exp\!\Bigl(\!-\dfrac{2t^2}{\sum_i (b_i - a_i)^2}\Bigr).$
- Chernoff
- $P(S \ge t) \le \inf_{\lambda \gt 0} e^{-\lambda t} M_S(\lambda)$ — the mother of all exponential tail bounds.
§ VIConvergence & Limit Theorems
Four modes of stochastic convergence, from strongest to weakest:
- Almost sure
- $X_n \xrightarrow{\text{a.s.}} X$ iff $P(\lim_n X_n = X) = 1$.
- In probability
- $X_n \xrightarrow{P} X$ iff $P(|X_n - X| \gt \varepsilon) \to 0$ for every $\varepsilon \gt 0$.
- In $L^p$
- $X_n \xrightarrow{L^p} X$ iff $\mathbb{E}[|X_n - X|^p] \to 0$.
- In distribution
- $X_n \xrightarrow{d} X$ iff $F_{X_n}(x) \to F_X(x)$ at every continuity point of $F_X$.
The implications: a.s. $\Rightarrow$ in probability $\Rightarrow$ in distribution. $L^p$ convergence for $p \ge 1$ implies convergence in probability.
Laws of large numbers
For i.i.d. $X_1, X_2, \ldots$ with finite mean $\mu$, let $\bar{X}_n = \tfrac{1}{n} \sum_{i=1}^n X_i$. Then:
- Weak LLN. $\bar{X}_n \xrightarrow{P} \mu$.
- Strong LLN. $\bar{X}_n \xrightarrow{\text{a.s.}} \mu$.
Central Limit Theorem
For i.i.d. $X_i$ with mean $\mu$ and finite variance $\sigma^2$:
$$\sqrt{n}\,(\bar{X}_n - \mu) \xrightarrow{d} \mathcal{N}(0,\, \sigma^2).$$Equivalently, $\bar{X}_n \approx \mathcal{N}(\mu, \sigma^2 / n)$ for large $n$. The CLT is the reason Gaussian noise is everywhere: averaging destroys the fingerprints of the underlying distribution.
Delta method
If $\sqrt{n}(\hat\theta_n - \theta) \xrightarrow{d} \mathcal{N}(0, \sigma^2)$ and $g$ is differentiable at $\theta$ with $g'(\theta) \ne 0$, then:
$$\sqrt{n}\bigl(g(\hat\theta_n) - g(\theta)\bigr) \xrightarrow{d} \mathcal{N}\bigl(0,\, \bigl[g'(\theta)\bigr]^2 \sigma^2\bigr).$$Slutsky
If $X_n \xrightarrow{d} X$ and $Y_n \xrightarrow{P} c$ (constant), then $X_n + Y_n \xrightarrow{d} X + c$ and $X_n Y_n \xrightarrow{d} cX$. Essential for constructing asymptotic confidence intervals when nuisance parameters are estimated.
§ VIIEstimation
An estimator $\hat\theta = T(X_1, \ldots, X_n)$ is a function of the data intended to approximate a parameter $\theta$. We judge it by:
- Bias
- $\mathrm{Bias}(\hat\theta) = \mathbb{E}[\hat\theta] - \theta$.
- Variance
- $\mathrm{Var}(\hat\theta) = \mathbb{E}[(\hat\theta - \mathbb{E}[\hat\theta])^2]$.
- Mean squared error
- $\mathrm{MSE}(\hat\theta) = \mathbb{E}[(\hat\theta - \theta)^2] = \mathrm{Bias}(\hat\theta)^2 + \mathrm{Var}(\hat\theta)$.
- Consistency
- $\hat\theta_n \xrightarrow{P} \theta$.
- Efficiency
- Attains the Cramér–Rao lower bound asymptotically.
Maximum likelihood
Given $X_1, \ldots, X_n \sim p(\cdot \mid \theta)$, the log-likelihood is $\ell(\theta) = \sum_{i=1}^n \log p(X_i \mid \theta)$. The MLE is $\hat\theta_{\text{MLE}} = \arg\max_\theta\, \ell(\theta)$. Under regularity conditions:
$$\sqrt{n}\bigl(\hat\theta_{\text{MLE}} - \theta_0\bigr) \xrightarrow{d} \mathcal{N}\!\bigl(0,\, \mathcal{I}(\theta_0)^{-1}\bigr),$$where $\mathcal{I}(\theta) = -\mathbb{E}\!\left[\dfrac{\partial^2 \log p(X \mid \theta)}{\partial \theta^2}\right]$ is the Fisher information per observation. The MLE is asymptotically unbiased, consistent, and efficient.
Cramér–Rao bound
For any unbiased estimator $\hat\theta$ of $\theta$:
$$\mathrm{Var}(\hat\theta) \ge \frac{1}{n\,\mathcal{I}(\theta)}.$$A universal floor on the variance — no cleverness recovers information that the likelihood didn't contain.
Method of moments
Set sample moments equal to theoretical moments and solve for $\theta$. Simpler than MLE, generally less efficient, but always a sanity check.
§ VIIIHypothesis Testing & Intervals
Anatomy of a test
A null hypothesis $H_0$ and an alternative $H_1$. A test statistic $T$ and a rejection region $R$. Two species of error:
- Type I ($\alpha$): reject $H_0$ when it is true. The significance level.
- Type II ($\beta$): fail to reject $H_0$ when $H_1$ is true. Power is $1 - \beta$.
The $p$-value is the probability, under $H_0$, of observing a statistic at least as extreme as the one observed. It is not the probability that $H_0$ is true — a distinction that has broken more inferences than any other.
The canonical tests
- Z-test
- Variance known: $Z = (\bar X - \mu_0)/(\sigma/\sqrt n) \sim \mathcal{N}(0,1)$ under $H_0$.
- t-test
- Variance estimated: $T = (\bar X - \mu_0)/(S/\sqrt n) \sim t_{n-1}$ under $H_0$.
- Likelihood ratio
- $\Lambda = \sup_{\theta \in \Theta_0} L(\theta) / \sup_{\theta \in \Theta} L(\theta)$. Reject for small $\Lambda$. Wilks: $-2 \log \Lambda \xrightarrow{d} \chi^2_k$ under $H_0$.
The Neyman–Pearson lemma tells us that for simple-versus-simple testing, the likelihood ratio test is uniformly most powerful.
Confidence intervals
A $100(1-\alpha)\%$ confidence interval $[L(X), U(X)]$ for $\theta$ satisfies $P_\theta(L \le \theta \le U) \ge 1 - \alpha$ for all $\theta$. The interval is random; $\theta$ is fixed.
For $\bar X$ from $\mathcal{N}(\mu, \sigma^2)$ (or large $n$ via CLT):
$$\bar X \pm z_{\alpha/2}\, \frac{\sigma}{\sqrt{n}} \qquad \text{or} \qquad \bar X \pm t_{n-1,\, \alpha/2}\, \frac{S}{\sqrt{n}}.$$§ IXBayesian Inference
The Bayesian views $\theta$ as random, encoding belief via a prior $\pi(\theta)$. Data $X$ updates it through the likelihood $p(X \mid \theta)$ into a posterior:
$$\pi(\theta \mid X) = \frac{p(X \mid \theta)\, \pi(\theta)}{\int p(X \mid \theta')\, \pi(\theta')\, d\theta'} \propto p(X \mid \theta)\, \pi(\theta).$$Point estimates from the posterior
- Posterior mean. $\hat\theta = \mathbb{E}[\theta \mid X]$ — minimises posterior MSE.
- Posterior median. Minimises posterior absolute error.
- MAP. $\hat\theta_{\text{MAP}} = \arg\max_\theta\, \pi(\theta \mid X)$ — minimises posterior $0$–$1$ loss.
Conjugate priors — worth knowing cold
| Likelihood | Prior | Posterior |
|---|---|---|
| Bernoulli / Binomial | Beta$(\alpha, \beta)$ | Beta$(\alpha + \sum x_i,\ \beta + n - \sum x_i)$ |
| Poisson | Gamma$(\alpha, \beta)$ | Gamma$(\alpha + \sum x_i,\ \beta + n)$ |
| Normal ($\sigma^2$ known) | Normal$(\mu_0, \tau_0^2)$ | Normal, precision-weighted mean of $\mu_0$ and $\bar x$ |
| Normal ($\mu$ known) | Inverse-Gamma$(\alpha, \beta)$ | Inverse-Gamma$(\alpha + n/2,\ \beta + \tfrac{1}{2}\sum (x_i - \mu)^2)$ |
| Multinomial | Dirichlet$(\boldsymbol{\alpha})$ | Dirichlet$(\boldsymbol{\alpha} + \mathbf{n})$ |
Credible intervals and predictives
A $(1-\alpha)$ credible interval is any interval of posterior mass $1 - \alpha$. Interpretively cleaner than a confidence interval — here $\theta$ really is the random thing.
The posterior predictive for new data $\tilde x$ is:
$$p(\tilde x \mid X) = \int p(\tilde x \mid \theta)\, \pi(\theta \mid X)\, d\theta,$$which averages over parameter uncertainty instead of collapsing it to a point.
§ XInformation Theory
Modern machine learning is not merely influenced by information theory — it often is information theory in a different hat. Loss functions are divergences; training is a minimum description length problem; attention is mutual information, softly.
Entropy
The Shannon entropy of a discrete $X$ with PMF $p$ is:
$$H(X) = -\sum_x p(x)\, \log p(x).$$It is the expected surprise, in nats (base $e$) or bits (base $2$). Maximised by the uniform distribution; zero when $X$ is deterministic. For continuous $X$ with PDF $f$, the differential entropy $h(X) = -\int f(x) \log f(x)\, dx$ — useful, but not an entropy in the strict sense (it can be negative).
Joint, conditional, mutual
- Joint
- $H(X, Y) = -\sum p(x, y) \log p(x, y)$.
- Conditional
- $H(Y \mid X) = H(X, Y) - H(X) = \mathbb{E}_X\!\bigl[H(Y \mid X = x)\bigr]$.
- Mutual information
- $I(X; Y) = H(X) + H(Y) - H(X, Y) = H(Y) - H(Y \mid X)$.
$I(X; Y) \ge 0$, with equality iff $X \perp Y$. It is the canonical measure of statistical dependence beyond linearity.
KL divergence and cross-entropy
$$D_{\mathrm{KL}}(p \,\|\, q) = \sum_x p(x)\, \log \frac{p(x)}{q(x)}.$$Not a metric — asymmetric and does not satisfy triangle inequality — but non-negative (Gibbs' inequality) and zero iff $p = q$ a.e. The cross-entropy $H(p, q) = -\sum p(x) \log q(x)$ satisfies $H(p, q) = H(p) + D_{\mathrm{KL}}(p \,\|\, q)$.
The chain of inequalities
$0 \le H(Y \mid X) \le H(Y) \le \log |\mathcal{Y}|$, and the data-processing inequality: if $X \to Y \to Z$ is a Markov chain, $I(X; Z) \le I(X; Y)$. Processing cannot create information.
§ XILinear Models & Bias–Variance
Linear regression models $Y = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$ with $\boldsymbol{\varepsilon}$ mean-zero and variance $\sigma^2 I$. The ordinary least squares estimator:
$$\hat{\boldsymbol{\beta}} = (\mathbf{X}^{\!\top} \mathbf{X})^{-1} \mathbf{X}^{\!\top} \mathbf{Y}$$is unbiased with covariance $\sigma^2 (\mathbf{X}^{\!\top} \mathbf{X})^{-1}$. The Gauss–Markov theorem asserts it is BLUE — the best linear unbiased estimator — under the stated conditions.
Regularised regression
- Ridge
- $\hat{\boldsymbol{\beta}}_{\text{ridge}} = \arg\min\,\|\mathbf{Y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda \|\boldsymbol{\beta}\|_2^2 = (\mathbf{X}^{\!\top} \mathbf{X} + \lambda I)^{-1} \mathbf{X}^{\!\top} \mathbf{Y}$.
- Lasso
- $\hat{\boldsymbol{\beta}}_{\text{lasso}} = \arg\min\,\|\mathbf{Y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda \|\boldsymbol{\beta}\|_1$ — sparsity-inducing.
Both are MAP estimators under Gaussian (ridge) or Laplace (lasso) priors on $\boldsymbol{\beta}$. See Exercise 5.
The bias–variance decomposition
For a fixed test point $x_0$, squared-error risk of an estimator $\hat f$ splits as:
$$\mathbb{E}\!\left[\bigl(Y - \hat f(x_0)\bigr)^2\right] = \underbrace{\sigma^2}_{\text{irreducible}} + \underbrace{\bigl(\mathbb{E}[\hat f(x_0)] - f(x_0)\bigr)^2}_{\text{bias}^2} + \underbrace{\mathrm{Var}(\hat f(x_0))}_{\text{variance}}.$$No estimator eliminates the first term; complexity trades the second for the third.
Generalisation risk
Population risk $R(f) = \mathbb{E}_{(X, Y)}[\ell(f(X), Y)]$; empirical risk $\hat R(f) = \tfrac{1}{n} \sum \ell(f(X_i), Y_i)$. Training error underestimates test error systematically; the gap is bounded (with high probability) by the complexity of the function class — VC dimension, Rademacher complexity, metric-entropy arguments.
§ XIIStochastic Processes
Markov chains
A sequence $X_0, X_1, \ldots$ with the Markov property: $P(X_{n+1} \mid X_n, X_{n-1}, \ldots) = P(X_{n+1} \mid X_n)$. For finite state space, the dynamics live in a transition matrix $P$ with $P_{ij} = P(X_{n+1} = j \mid X_n = i)$.
A distribution $\boldsymbol{\pi}$ is stationary if $\boldsymbol{\pi} = \boldsymbol{\pi} P$. An irreducible aperiodic finite chain has a unique stationary distribution, and $P^n \to \mathbf{1} \boldsymbol{\pi}$ — the ergodic theorem.
Detailed balance: if $\pi_i P_{ij} = \pi_j P_{ji}$ for all $i, j$, then $\boldsymbol{\pi}$ is stationary. The starting point for MCMC: design $P$ so detailed balance holds for a target $\boldsymbol{\pi}$.
Poisson process
A counting process $N(t)$ with:
- $N(0) = 0$;
- independent increments;
- $N(t + s) - N(s) \sim$ Poisson$(\lambda t)$.
Equivalently, inter-arrival times are i.i.d. Exp$(\lambda)$. The memorylessness of the exponential is what makes the Poisson process analytically tractable — and a near-universal model for rare independent events.
Brownian motion
A process $W_t$ with $W_0 = 0$, independent Gaussian increments $W_t - W_s \sim \mathcal{N}(0, t-s)$ for $s \lt t$, and continuous sample paths. The basic object of stochastic calculus; the limit of scaled random walks; the noise term in Black–Scholes. Non-differentiable everywhere with probability one — a warning that intuition built on smooth calculus can mislead here.
Martingales
$\{M_n\}$ is a martingale with respect to a filtration $\{\mathcal{F}_n\}$ if $\mathbb{E}[|M_n|] \lt \infty$, $M_n$ is $\mathcal{F}_n$-measurable, and:
$$\mathbb{E}[M_{n+1} \mid \mathcal{F}_n] = M_n.$$Intuitively: a fair game. The optional stopping theorem (under bounded-stopping-time hypotheses) extends this to random stopping times: $\mathbb{E}[M_\tau] = \mathbb{E}[M_0]$. The one-line explanation for why you cannot beat a fair casino — and the workhorse of no-arbitrage pricing.
§ XIIIMeasure-Theoretic Foundations
Everything so far has coasted on an informal reading of $P$ as "a function that assigns probabilities to events". For most working calculations this is enough — but for limit theorems, for infinite-dimensional processes like Brownian motion, and for the no-arbitrage pricing theorems of quantitative finance, we need the proper machinery. The payoff is not mere rigour; it is the ability to swap limits and integrals, change measures, and make sense of integrals against paths that have no derivative anywhere.
$\sigma$-algebras and measurable spaces
A $\sigma$-algebra $\mathcal{F}$ on a set $\Omega$ is a collection of subsets containing $\Omega$ and closed under complementation and countable unions. The pair $(\Omega, \mathcal{F})$ is a measurable space. The canonical example on $\mathbb{R}$ is the Borel $\sigma$-algebra $\mathcal{B}(\mathbb{R})$ — the smallest $\sigma$-algebra containing all open sets.
A measure $\mu: \mathcal{F} \to [0, \infty]$ satisfies $\mu(\emptyset) = 0$ and countable additivity: $\mu\!\left(\bigsqcup_i A_i\right) = \sum_i \mu(A_i)$. A probability measure is a measure with $\mu(\Omega) = 1$. Lebesgue measure on $\mathbb{R}$ is the unique translation-invariant Borel measure assigning length $b - a$ to $[a, b]$.
Measurable functions and random variables
$f: (\Omega, \mathcal{F}) \to (\mathbb{R}, \mathcal{B})$ is measurable if $f^{-1}(B) \in \mathcal{F}$ for every $B \in \mathcal{B}$. A random variable is precisely a measurable function on a probability space; the definition we gave in §II was an informal cousin.
The Lebesgue integral
Built in three stages:
- For simple functions $\varphi = \sum_{i=1}^n c_i \mathbf{1}_{A_i}$: $\int \varphi\, d\mu = \sum_i c_i\, \mu(A_i)$.
- For non-negative measurable $f$: $\int f\, d\mu = \sup\!\left\{\int \varphi\, d\mu : 0 \le \varphi \le f,\ \varphi \text{ simple}\right\}$.
- For general $f$: split $f = f^+ - f^-$ and set $\int f = \int f^+ - \int f^-$ whenever at least one is finite.
The Lebesgue integral extends the Riemann integral and behaves much better under limits — which is where its power lives.
The three pillars of convergence
- Monotone Convergence
- If $0 \le f_n \uparrow f$ a.e., then $\int f_n\, d\mu \to \int f\, d\mu$.
- Fatou's Lemma
- For non-negative measurable $f_n$: $\int \liminf_n f_n\, d\mu \le \liminf_n \int f_n\, d\mu$.
- Dominated Convergence
- If $f_n \to f$ a.e.\ and $|f_n| \le g$ with $\int g\, d\mu \lt \infty$, then $\int f_n\, d\mu \to \int f\, d\mu$.
These are the only reliable ways to swap $\lim$ and $\int$. Without them, nearly every probabilistic limit argument stalls.
Fubini–Tonelli
On a product measure space $(\Omega_1 \times \Omega_2, \mathcal{F}_1 \otimes \mathcal{F}_2, \mu_1 \otimes \mu_2)$, iterated integration is legal:
- Tonelli: for non-negative measurable $f$, any order works and all are equal (possibly $+\infty$).
- Fubini: for $f$ absolutely integrable, any order works and gives the same finite value.
Radon–Nikodym
Given $\sigma$-finite measures $\mu, \nu$ on $(\Omega, \mathcal{F})$ with $\mu \ll \nu$ (absolutely continuous: $\nu(A) = 0 \Rightarrow \mu(A) = 0$), there exists an essentially unique measurable $f \ge 0$ with:
$$\mu(A) = \int_A f\, d\nu \quad \text{for every } A \in \mathcal{F}.$$We write $f = d\mu / d\nu$, the Radon–Nikodym derivative. A PDF is precisely $dP/d\lambda$ with respect to Lebesgue measure $\lambda$. A likelihood ratio is $dP_1/dP_0$. A pricing kernel is $d\mathbb{Q}/d\mathbb{P}$. Three different dialects for the same object.
Borel–Cantelli
For events $A_1, A_2, \ldots$ let $\{A_n \text{ i.o.}\} = \limsup_n A_n = \bigcap_{N} \bigcup_{n \ge N} A_n$ — the event that infinitely many $A_n$ occur. Then:
- BC I. If $\sum_n P(A_n) \lt \infty$, then $P(A_n \text{ i.o.}) = 0$.
- BC II. If the $A_n$ are independent and $\sum_n P(A_n) = \infty$, then $P(A_n \text{ i.o.}) = 1$.
A zero–one dichotomy that undergirds almost-sure convergence proofs.
§ XIVFiltrations & Martingales
Conditional expectation, rigorously
For a sub-$\sigma$-algebra $\mathcal{G} \subseteq \mathcal{F}$ and $X \in L^1(\Omega, \mathcal{F}, P)$, the conditional expectation $\mathbb{E}[X \mid \mathcal{G}]$ is the $P$-a.s.\ unique $\mathcal{G}$-measurable random variable satisfying:
$$\int_A \mathbb{E}[X \mid \mathcal{G}]\, dP = \int_A X\, dP \quad \text{for every } A \in \mathcal{G}.$$Existence is guaranteed by Radon–Nikodym applied to the signed measure $A \mapsto \int_A X\, dP$. For $X \in L^2$, $\mathbb{E}[X \mid \mathcal{G}]$ is exactly the orthogonal projection of $X$ onto the closed subspace $L^2(\mathcal{G})$ — conditional expectation is a Hilbert-space projection.
The axioms you will use every day
- Linearity
- $\mathbb{E}[aX + bY \mid \mathcal{G}] = a\,\mathbb{E}[X \mid \mathcal{G}] + b\,\mathbb{E}[Y \mid \mathcal{G}]$.
- Tower
- If $\mathcal{H} \subseteq \mathcal{G}$: $\mathbb{E}\!\bigl[\mathbb{E}[X \mid \mathcal{G}] \mid \mathcal{H}\bigr] = \mathbb{E}[X \mid \mathcal{H}]$.
- Take out what is known
- If $Y$ is $\mathcal{G}$-measurable and $XY \in L^1$: $\mathbb{E}[XY \mid \mathcal{G}] = Y\, \mathbb{E}[X \mid \mathcal{G}]$.
- Independence
- If $X \perp \mathcal{G}$: $\mathbb{E}[X \mid \mathcal{G}] = \mathbb{E}[X]$.
- Conditional Jensen
- For convex $\varphi$: $\varphi\!\bigl(\mathbb{E}[X \mid \mathcal{G}]\bigr) \le \mathbb{E}[\varphi(X) \mid \mathcal{G}]$.
Filtrations, adaptedness, stopping times
A filtration is an increasing family $\{\mathcal{F}_t\}_{t \ge 0}$ of sub-$\sigma$-algebras of $\mathcal{F}$. Read $\mathcal{F}_t$ as "the information observable by time $t$". A process $X_t$ is adapted if $X_t$ is $\mathcal{F}_t$-measurable for every $t$.
A stopping time is a $[0, \infty]$-valued random variable $\tau$ with $\{\tau \le t\} \in \mathcal{F}_t$ for every $t$. First-entry times into a Borel set are stopping times; last-exit times generally are not — you would need the future to know.
Martingales, sub- and super-
An adapted, integrable process $M_t$ is:
- a martingale if $\mathbb{E}[M_t \mid \mathcal{F}_s] = M_s$ for $s \le t$;
- a submartingale if $\mathbb{E}[M_t \mid \mathcal{F}_s] \ge M_s$;
- a supermartingale if $\mathbb{E}[M_t \mid \mathcal{F}_s] \le M_s$.
Mnemonic: supermartingales go down (in expectation). Your wealth in a casino is a supermartingale. Your wealth at the roulette wheel of a fair game is a martingale.
Doob's maximal inequality
For a non-negative submartingale $X_t$ and $\lambda \gt 0$:
$$P\!\left(\sup_{s \le t} X_s \ge \lambda\right) \le \frac{\mathbb{E}[X_t]}{\lambda}.$$The $L^p$ version ($p \gt 1$):
$$\mathbb{E}\!\left[\sup_{s \le t} X_s^p\right] \le \left(\frac{p}{p-1}\right)^{\!p} \mathbb{E}[X_t^p].$$Remarkably, control of $X_t$ in $L^p$ gives control of the running maximum in $L^p$ — a uniform-in-time bound from a pointwise one.
Optional stopping, with honest hypotheses
For a martingale $M$ and stopping time $\tau$, $\mathbb{E}[M_\tau] = \mathbb{E}[M_0]$ provided at least one of the following:
- $\tau$ is bounded by some constant $T \lt \infty$;
- $M$ is uniformly bounded and $\tau \lt \infty$ a.s.;
- $\mathbb{E}[\tau] \lt \infty$ and the increments $|M_n - M_{n-1}|$ are uniformly bounded.
Without some such condition, the theorem fails: a simple random walk is a martingale, and $\tau = \inf\{n : S_n = 1\}$ is finite a.s., yet $\mathbb{E}[S_\tau] = 1 \ne 0 = \mathbb{E}[S_0]$. The failure: $\mathbb{E}[\tau] = \infty$.
Martingale convergence
An $L^1$-bounded martingale — one with $\sup_t \mathbb{E}[|M_t|] \lt \infty$ — converges $P$-a.s.\ to a limit $M_\infty \in L^1$. In particular, a non-negative supermartingale always converges a.s. These are tools for proving a.s.\ convergence of algorithms (stochastic approximation, likelihood ratios, posteriors).
§ XVBrownian Motion & Itô Calculus
Brownian motion, properly
A standard Brownian motion (Wiener process) $\{W_t\}_{t \ge 0}$ on a filtered probability space $(\Omega, \mathcal{F}, \{\mathcal{F}_t\}, P)$ satisfies:
- $W_0 = 0$ a.s.;
- independent increments: for $s \lt t$, $W_t - W_s$ is independent of $\mathcal{F}_s$;
- Gaussian increments: $W_t - W_s \sim \mathcal{N}(0, t - s)$;
- continuous sample paths a.s.
Existence is a theorem (Wiener 1923): several constructions work — Lévy's piecewise-linear approximation, Kolmogorov's extension via finite-dimensional distributions, or Donsker's invariance principle (random walks, rescaled, converge in distribution to $W$).
Pathological pathwise regularity
With probability one:
- $t \mapsto W_t$ is continuous but nowhere differentiable;
- $W$ has infinite total variation on every interval;
- the quadratic variation is finite and deterministic: $\langle W \rangle_t = t$, meaning $\sum_i (W_{t_{i+1}} - W_{t_i})^2 \xrightarrow{L^2} t$ as the partition mesh shrinks.
The finite quadratic variation is the whole key. It is what prevents a classical Riemann–Stieltjes integral $\int H\, dW$ from existing, and what forces the Itô construction.
Useful symmetries
If $W$ is standard Brownian motion, so are: $-W_t$ (reflection); $W_{t+s} - W_s$ (Markov property); $\sqrt{c}\, W_{t/c}$ (scaling); $t\, W_{1/t}$ for $t \gt 0$ (time inversion). These identities are how closed-form quantities get computed.
The Itô integral
For an adapted process $H$ with $\mathbb{E}\!\int_0^T H_s^2\, ds \lt \infty$, the Itô integral $\int_0^t H_s\, dW_s$ is constructed as the $L^2$-limit of "left-endpoint" Riemann sums $\sum_i H_{t_i}(W_{t_{i+1}} - W_{t_i})$. The choice of left endpoint is not cosmetic: evaluating at the midpoint produces the Stratonovich integral, which behaves like ordinary calculus but destroys the martingale property.
Itô isometry (the identity that makes the whole edifice work):
$$\mathbb{E}\!\left[\left(\int_0^t H_s\, dW_s\right)^{\!2}\right] = \mathbb{E}\!\left[\int_0^t H_s^2\, ds\right].$$As a consequence, $M_t := \int_0^t H_s\, dW_s$ is a mean-zero martingale with variance $\mathbb{E}\!\int_0^t H_s^2\, ds$.
Itô's formula
Let $X_t$ be an Itô process: $dX_t = \mu_t\, dt + \sigma_t\, dW_t$. For $f \in C^{1,2}([0, T] \times \mathbb{R})$:
$$df(t, X_t) = \frac{\partial f}{\partial t}\, dt + \frac{\partial f}{\partial x}\, dX_t + \frac{1}{2} \frac{\partial^2 f}{\partial x^2}\, \sigma_t^2\, dt.$$The extra half-second-derivative term — sometimes called the Itô correction — is the mathematical signature of quadratic variation. In an informal calculus: $(dW_t)^2 = dt$.
Geometric Brownian motion
The SDE $dS_t = \mu\, S_t\, dt + \sigma\, S_t\, dW_t$ has closed-form solution:
$$S_t = S_0 \exp\!\left(\bigl(\mu - \tfrac{1}{2}\sigma^2\bigr) t + \sigma\, W_t\right).$$Apply Itô to $f(x) = \log x$ to see where the $-\tfrac{1}{2}\sigma^2$ comes from. This is the foundational model of log-normal asset prices in Black–Scholes theory.
Girsanov's theorem
Given an adapted process $\theta_t$ satisfying Novikov's condition $\mathbb{E}\exp\!\bigl(\tfrac{1}{2}\int_0^T \theta_s^2\, ds\bigr) \lt \infty$, define a new measure $\mathbb{Q}$ by:
$$\frac{d\mathbb{Q}}{d\mathbb{P}} = \exp\!\left(-\int_0^T \theta_s\, dW_s - \frac{1}{2} \int_0^T \theta_s^2\, ds\right).$$Then $\widetilde W_t := W_t + \int_0^t \theta_s\, ds$ is a standard Brownian motion under $\mathbb{Q}$. Drift is not intrinsic to the process — it is a feature of the measure.
Exercises & Solutions
Work each problem with pen in hand before opening the solution. The solution is a conversation; the struggle is the lesson.
A disease has prevalence $0.1\%$ in the population. A test has sensitivity $99\%$ (true positive rate) and specificity $98\%$ (true negative rate). A random individual tests positive. What is the probability they have the disease?
Open solution
Let $D$ denote "has disease" and $+$ the positive test. We want $P(D \mid +)$. By Bayes:
$$P(D \mid +) = \frac{P(+ \mid D)\, P(D)}{P(+ \mid D)\, P(D) + P(+ \mid D^c)\, P(D^c)}.$$Plug in $P(D) = 0.001$, $P(+ \mid D) = 0.99$, $P(+ \mid D^c) = 0.02$:
$$P(D \mid +) = \frac{0.99 \cdot 0.001}{0.99 \cdot 0.001 + 0.02 \cdot 0.999} = \frac{0.00099}{0.02097} \approx 0.0472.$$Despite a seemingly accurate test, the posterior is under $5\%$. Base rates dominate when the disease is rare — a result that recurs in imbalanced classification and fraud detection. The false positive rate is small per negative case, but negatives are so numerous they flood the positives.
$n$ balls are thrown independently and uniformly into $n$ bins. Let $X$ be the number of empty bins after all throws. Compute $\mathbb{E}[X]$.
Open solution
Let $I_j = \mathbf{1}\{\text{bin } j \text{ is empty}\}$ for $j = 1, \ldots, n$, so that $X = \sum_{j=1}^n I_j$. By linearity of expectation — which holds without independence:
$$\mathbb{E}[X] = \sum_{j=1}^n \mathbb{E}[I_j] = \sum_{j=1}^n P(\text{bin } j \text{ empty}).$$Each ball independently avoids bin $j$ with probability $1 - 1/n$. Across $n$ balls:
$$P(\text{bin } j \text{ empty}) = \left(1 - \frac{1}{n}\right)^n.$$Hence $\mathbb{E}[X] = n\bigl(1 - 1/n\bigr)^n \to n/e$ as $n \to \infty$. Around $36.8\%$ of bins sit empty in expectation — an effect responsible for hash collisions, birthday paradoxes, and the coupon collector's friends.
The key move: never compute joint distributions when you can compute marginals. The $I_j$ are dependent, but linearity doesn't care.
A coin's bias $P \sim \mathrm{Uniform}(0, 1)$. Conditional on $P = p$, we flip the coin $n$ times and let $X$ be the number of heads. Find $\mathbb{E}[X]$ and $\mathrm{Var}(X)$.
Open solution
Expectation. Conditional on $P$, $X \sim \mathrm{Binomial}(n, P)$, so $\mathbb{E}[X \mid P] = nP$. By the tower rule:
$$\mathbb{E}[X] = \mathbb{E}\!\bigl[\mathbb{E}[X \mid P]\bigr] = \mathbb{E}[nP] = n \cdot \tfrac{1}{2} = \tfrac{n}{2}.$$Variance. We need both pieces of Eve's law:
- $\mathrm{Var}(X \mid P) = nP(1 - P)$, so $\mathbb{E}[\mathrm{Var}(X \mid P)] = n\, \mathbb{E}[P(1-P)] = n\!\left(\tfrac{1}{2} - \tfrac{1}{3}\right) = \tfrac{n}{6}$.
- $\mathbb{E}[X \mid P] = nP$, so $\mathrm{Var}(\mathbb{E}[X \mid P]) = n^2 \mathrm{Var}(P) = n^2/12$.
Summing:
$$\mathrm{Var}(X) = \frac{n}{6} + \frac{n^2}{12}.$$Compare with a Binomial$(n, 1/2)$ of variance $n/4$: the uniform prior on $P$ inflates variance by the $n^2/12$ term — the "variation across parameter worlds". Aleatoric plus epistemic uncertainty, in essence.
Let $X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \mathrm{Exp}(\lambda)$. Derive the maximum likelihood estimator of $\lambda$, determine its bias, and compute its asymptotic variance.
Open solution
Likelihood. $L(\lambda) = \prod_{i=1}^n \lambda e^{-\lambda x_i} = \lambda^n \exp\!\left(-\lambda \sum x_i\right)$. The log-likelihood:
$$\ell(\lambda) = n \log \lambda - \lambda \sum_{i=1}^n x_i.$$Setting $\ell'(\lambda) = n/\lambda - \sum x_i = 0$ gives:
$$\hat\lambda_{\text{MLE}} = \frac{n}{\sum_{i=1}^n X_i} = \frac{1}{\bar X}.$$Bias. $\sum X_i \sim \mathrm{Gamma}(n, 1/\lambda)$, so $1/\sum X_i$ has an inverse-Gamma distribution. A direct calculation using $\mathbb{E}[1/Y]$ for $Y \sim \mathrm{Gamma}(n, 1/\lambda)$:
$$\mathbb{E}[\hat\lambda] = \mathbb{E}\!\left[\frac{n}{\sum X_i}\right] = n \cdot \frac{\lambda}{n-1} = \frac{n}{n-1}\, \lambda.$$Biased upward, by a factor that vanishes as $n \to \infty$. The bias-corrected estimator is $\tilde\lambda = \tfrac{n-1}{n} \hat\lambda$.
Asymptotic variance. The Fisher information is:
$$\mathcal{I}(\lambda) = -\mathbb{E}\!\left[\frac{\partial^2}{\partial \lambda^2} \log f(X; \lambda)\right] = -\mathbb{E}\!\left[-\frac{1}{\lambda^2}\right] = \frac{1}{\lambda^2}.$$By the MLE asymptotic normality theorem:
$$\sqrt{n}(\hat\lambda - \lambda) \xrightarrow{d} \mathcal{N}(0, \lambda^2).$$Assume $\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$ with $\boldsymbol{\varepsilon} \sim \mathcal{N}(0, \sigma^2 I)$ and prior $\boldsymbol{\beta} \sim \mathcal{N}(0, \tau^2 I)$. Show that the MAP estimator of $\boldsymbol{\beta}$ is the ridge-regression solution, and identify the regularisation parameter.
Open solution
The posterior density satisfies $\pi(\boldsymbol{\beta} \mid \mathbf{Y}) \propto p(\mathbf{Y} \mid \boldsymbol{\beta})\, \pi(\boldsymbol{\beta})$. Taking logs:
$$\log \pi(\boldsymbol{\beta} \mid \mathbf{Y}) = -\frac{1}{2\sigma^2} \|\mathbf{Y} - \mathbf{X}\boldsymbol{\beta}\|^2 - \frac{1}{2\tau^2} \|\boldsymbol{\beta}\|^2 + \text{const}.$$Maximising is the same as minimising the negative, and multiplying by $2\sigma^2$ preserves the argmin:
$$\hat{\boldsymbol{\beta}}_{\text{MAP}} = \arg\min_{\boldsymbol{\beta}}\ \|\mathbf{Y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda\, \|\boldsymbol{\beta}\|^2, \qquad \lambda = \frac{\sigma^2}{\tau^2}.$$Exactly the ridge objective. The closed-form solution is $\hat{\boldsymbol{\beta}} = (\mathbf{X}^{\!\top}\mathbf{X} + \lambda I)^{-1} \mathbf{X}^{\!\top}\mathbf{Y}$.
Interpretation. A tight prior ($\tau$ small) shrinks estimates aggressively toward zero; a diffuse prior ($\tau \to \infty$) recovers OLS. Replacing the Gaussian prior with a Laplace prior $\boldsymbol{\beta}_i \stackrel{\text{iid}}{\sim} \mathrm{Laplace}(0, b)$ gives an $\ell_1$ penalty — the lasso — and with it, sparsity.
This exercise is really a worked proof that regularisation and priors are one object dressed in two vocabularies.
Let $Y = f(x_0) + \varepsilon$ with $\mathbb{E}[\varepsilon] = 0$, $\mathrm{Var}(\varepsilon) = \sigma^2$, and $\varepsilon$ independent of the training data. Let $\hat f$ be an estimator trained on i.i.d. data. Prove that the expected squared-error at $x_0$ decomposes as:
$$\mathbb{E}\!\bigl[(Y - \hat f(x_0))^2\bigr] = \sigma^2 + \bigl(\mathbb{E}[\hat f(x_0)] - f(x_0)\bigr)^2 + \mathrm{Var}(\hat f(x_0)).$$Open solution
Write $\hat f \equiv \hat f(x_0)$ and $f \equiv f(x_0)$ for brevity. Expand around $f$:
$$Y - \hat f = (f + \varepsilon) - \hat f = \varepsilon + (f - \hat f).$$Square and take expectation:
$$\mathbb{E}[(Y - \hat f)^2] = \mathbb{E}[\varepsilon^2] + 2\, \mathbb{E}[\varepsilon (f - \hat f)] + \mathbb{E}[(f - \hat f)^2].$$The cross term vanishes: $\varepsilon$ is independent of $\hat f$ (which depends only on training data), and $\mathbb{E}[\varepsilon] = 0$, so $\mathbb{E}[\varepsilon(f - \hat f)] = (f - \mathbb{E}[\hat f])\cdot 0 = 0$. The first term is $\sigma^2$.
For the third term, add and subtract $\mathbb{E}[\hat f]$:
$$\mathbb{E}[(f - \hat f)^2] = \mathbb{E}\!\bigl[\bigl((f - \mathbb{E}[\hat f]) + (\mathbb{E}[\hat f] - \hat f)\bigr)^2\bigr].$$Expanding, $(f - \mathbb{E}[\hat f])$ is a constant, and the cross term $2(f - \mathbb{E}[\hat f])\, \mathbb{E}[\mathbb{E}[\hat f] - \hat f] = 0$. Hence:
$$\mathbb{E}[(f - \hat f)^2] = \underbrace{(f - \mathbb{E}[\hat f])^2}_{\text{bias}^2} + \underbrace{\mathbb{E}[(\hat f - \mathbb{E}[\hat f])^2]}_{\text{variance}}.$$Combining, the claim follows. $\sigma^2$ is irreducible — no choice of model eliminates it; the rest is the eternal tradeoff between approximation error (bias) and estimation error (variance), mediated by model complexity.
Let $p = \mathcal{N}(\mu_1, \sigma_1^2)$ and $q = \mathcal{N}(\mu_2, \sigma_2^2)$. Derive a closed form for $D_{\mathrm{KL}}(p \,\|\, q)$.
Open solution
By definition,
$$D_{\mathrm{KL}}(p \,\|\, q) = \mathbb{E}_p\!\left[\log \frac{p(X)}{q(X)}\right] = \mathbb{E}_p[\log p(X)] - \mathbb{E}_p[\log q(X)].$$For a Gaussian, $\log p(x) = -\tfrac{1}{2}\log(2\pi\sigma_1^2) - \tfrac{(x-\mu_1)^2}{2\sigma_1^2}$, so:
$$\mathbb{E}_p[\log p(X)] = -\tfrac{1}{2}\log(2\pi\sigma_1^2) - \tfrac{1}{2}.$$(Since $\mathbb{E}_p[(X - \mu_1)^2] = \sigma_1^2$.) Likewise,
$$\mathbb{E}_p[\log q(X)] = -\tfrac{1}{2}\log(2\pi\sigma_2^2) - \frac{\mathbb{E}_p[(X - \mu_2)^2]}{2\sigma_2^2}.$$Expand $(X - \mu_2)^2 = (X - \mu_1)^2 + 2(X - \mu_1)(\mu_1 - \mu_2) + (\mu_1 - \mu_2)^2$ and take expectation under $p$:
$$\mathbb{E}_p[(X - \mu_2)^2] = \sigma_1^2 + (\mu_1 - \mu_2)^2.$$Putting it together:
$$\boxed{\,D_{\mathrm{KL}}(p \,\|\, q) = \log \frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2}.\,}$$The multivariate analogue, with $\mathcal{N}(\boldsymbol{\mu}_1, \boldsymbol{\Sigma}_1)$ and $\mathcal{N}(\boldsymbol{\mu}_2, \boldsymbol{\Sigma}_2)$ in $d$ dimensions:
$$D_{\mathrm{KL}} = \tfrac{1}{2}\!\left[\log \tfrac{|\boldsymbol{\Sigma}_2|}{|\boldsymbol{\Sigma}_1|} - d + \mathrm{tr}(\boldsymbol{\Sigma}_2^{-1} \boldsymbol{\Sigma}_1) + (\boldsymbol{\mu}_2 - \boldsymbol{\mu}_1)^{\!\top} \boldsymbol{\Sigma}_2^{-1} (\boldsymbol{\mu}_2 - \boldsymbol{\mu}_1)\right].$$This formula is the analytic heartbeat of variational autoencoders: the KL term between encoder $q_\phi(z \mid x)$ and standard-normal prior is an instance of it.
You place a Beta$(\alpha, \beta)$ prior on a coin's probability of heads $\theta$ and observe $k$ heads in $n$ flips. Derive the posterior distribution and compare the posterior mean, MLE, and prior mean.
Open solution
Likelihood: $p(k \mid \theta) = \binom{n}{k} \theta^k (1 - \theta)^{n-k}$. Prior: $\pi(\theta) \propto \theta^{\alpha - 1} (1 - \theta)^{\beta - 1}$. The posterior is their product, up to normalisation:
$$\pi(\theta \mid k) \propto \theta^{\alpha + k - 1} (1 - \theta)^{\beta + n - k - 1}.$$This is a Beta density with updated parameters:
$$\theta \mid k \sim \mathrm{Beta}(\alpha + k,\ \beta + n - k).$$The posterior mean is:
$$\hat\theta_{\text{post}} = \frac{\alpha + k}{\alpha + \beta + n}.$$Rewrite as a convex combination:
$$\hat\theta_{\text{post}} = \frac{\alpha + \beta}{\alpha + \beta + n} \cdot \underbrace{\frac{\alpha}{\alpha + \beta}}_{\text{prior mean}} + \frac{n}{\alpha + \beta + n} \cdot \underbrace{\frac{k}{n}}_{\text{MLE}}.$$The posterior mean is a precision-weighted average of the prior mean and the MLE. As $n \to \infty$, data dominates and the estimator converges to $k/n$. For small $n$ or strong prior ($\alpha + \beta$ large), the prior dominates. This is the Bayesian version of shrinkage.
The parameters $\alpha$ and $\beta$ act as pseudo-counts: a Beta$(\alpha, \beta)$ prior is equivalent to having observed $\alpha - 1$ prior heads and $\beta - 1$ prior tails — an interpretation that makes the update rule almost trivial to remember.
A trading strategy's daily P&L has unknown mean $\mu$ and standard deviation $\sigma \approx 2\%$. You want a $95\%$ confidence interval for $\mu$ with half-width at most $0.1\%$. How many trading days do you need?
Open solution
By the CLT, $\bar X_n \approx \mathcal{N}(\mu, \sigma^2 / n)$ for large $n$. A $95\%$ interval has half-width:
$$h = z_{0.025} \cdot \frac{\sigma}{\sqrt{n}} \approx 1.96 \cdot \frac{\sigma}{\sqrt{n}}.$$Requiring $h \le 0.001$ with $\sigma = 0.02$:
$$\sqrt{n} \ge \frac{1.96 \cdot 0.02}{0.001} = 39.2 \quad\Longrightarrow\quad n \ge 1537.$$Roughly six trading years — sobering. The dependence $h \propto 1/\sqrt{n}$ is the fundamental drag of statistics: to halve the uncertainty, you need four times the data. To reach a tenth, a hundred times.
Assumptions worth auditing. The CLT requires finite variance and (for this calculation's cleanness) approximate independence. Financial returns show volatility clustering and heavy tails — the effective sample size can be much smaller than the nominal one, and the Gaussian approximation in the tail is unreliable exactly where risk lives.
Let $(X, Y) \sim \mathcal{N}\!\left(\begin{pmatrix}\mu_X \\ \mu_Y\end{pmatrix},\ \begin{pmatrix}\sigma_X^2 & \rho \sigma_X \sigma_Y \\ \rho \sigma_X \sigma_Y & \sigma_Y^2\end{pmatrix}\right).$ Derive $\mathbb{E}[Y \mid X = x]$ and $\mathrm{Var}(Y \mid X = x)$.
Open solution
Apply the general Gaussian conditioning formula from §IV with blocks of size 1:
$$\mathbb{E}[Y \mid X = x] = \mu_Y + \Sigma_{YX} \Sigma_{XX}^{-1} (x - \mu_X) = \mu_Y + \rho \frac{\sigma_Y}{\sigma_X}(x - \mu_X).$$ $$\mathrm{Var}(Y \mid X = x) = \Sigma_{YY} - \Sigma_{YX} \Sigma_{XX}^{-1} \Sigma_{XY} = \sigma_Y^2 (1 - \rho^2).$$Two observations.
- The conditional mean is linear in $x$ with slope $\rho \sigma_Y / \sigma_X$. This is exactly the population regression coefficient of $Y$ on $X$. Gaussian conditioning is linear regression.
- The conditional variance does not depend on $x$ — Gaussians are homoscedastic. It shrinks by the factor $1 - \rho^2$: larger correlation means more information from $X$ about $Y$.
Sanity check. If $\rho = 0$: $\mathbb{E}[Y \mid X] = \mu_Y$, $\mathrm{Var}(Y \mid X) = \sigma_Y^2$. Independence recovered. If $|\rho| \to 1$: $\mathrm{Var}(Y \mid X) \to 0$; knowing $X$ pins down $Y$ exactly.
Let $X_1, \ldots, X_n$ be i.i.d. with mean $\mu$ and variance $\sigma^2$. Using Chebyshev, find an $n$ guaranteeing $P(|\bar X_n - \mu| \ge \varepsilon) \le \delta$. Compare with the Hoeffding bound when $X_i \in [0, 1]$.
Open solution
Chebyshev. $\bar X_n$ has mean $\mu$ and variance $\sigma^2 / n$. Applying the inequality:
$$P(|\bar X_n - \mu| \ge \varepsilon) \le \frac{\sigma^2}{n \varepsilon^2}.$$For the bound to be at most $\delta$, pick $n \ge \sigma^2 / (\delta \varepsilon^2)$. The dependence is $1/\delta$ — polynomial in the confidence parameter.
Hoeffding. For $X_i \in [0, 1]$:
$$P(|\bar X_n - \mu| \ge \varepsilon) \le 2 e^{-2 n \varepsilon^2}.$$To make this at most $\delta$: $n \ge \dfrac{1}{2\varepsilon^2} \log(2/\delta)$. Dependence on $\delta$ is $\log(1/\delta)$ — exponentially better.
Concrete comparison. Take $\varepsilon = 0.05$, $\delta = 0.01$, and bounded $X_i \in [0,1]$ so $\sigma^2 \le 1/4$. Chebyshev demands $n \ge 10{,}000$; Hoeffding demands only $n \ge 1{,}060$.
Chebyshev assumes only finite variance and is tight only for pathological distributions; Hoeffding exploits boundedness and tightens the tail dramatically. When you know more, you get more — a general lesson in concentration theory.
A weather Markov chain has states {Sun, Rain}. Transitions: from Sun, $P(\text{Sun} \to \text{Rain}) = 0.1$. From Rain, $P(\text{Rain} \to \text{Sun}) = 0.5$. Find the stationary distribution and the expected return time to Sun.
Open solution
Transition matrix (rows = from, columns = to), ordering [Sun, Rain]:
$$P = \begin{pmatrix} 0.9 & 0.1 \\ 0.5 & 0.5 \end{pmatrix}.$$A stationary distribution $\boldsymbol{\pi} = (\pi_S, \pi_R)$ satisfies $\boldsymbol{\pi} P = \boldsymbol{\pi}$ and $\pi_S + \pi_R = 1$. Component-wise:
$$\pi_S = 0.9\, \pi_S + 0.5\, \pi_R, \qquad \pi_R = 0.1\, \pi_S + 0.5\, \pi_R.$$The first gives $0.1\, \pi_S = 0.5\, \pi_R$, i.e.\ $\pi_S = 5 \pi_R$. With the normalisation $6 \pi_R = 1$:
$$\pi_S = \tfrac{5}{6}, \qquad \pi_R = \tfrac{1}{6}.$$Expected return time. For an irreducible, positive recurrent chain, the mean return time to state $i$ is $1/\pi_i$ (Kac's lemma). Thus:
$$\mathbb{E}[\text{return time to Sun}] = \frac{1}{\pi_S} = \frac{6}{5} = 1.2 \text{ days}.$$That return time is small because the chain spends most of its time in Sun anyway; the punchline is the general principle, not the number.
To estimate $\mu = \mathbb{E}_p[h(X)]$ when sampling from $p$ is costly, we draw $X_i \stackrel{\text{iid}}{\sim} q$ (with $q(x) \gt 0$ wherever $p(x) h(x) \ne 0$) and use the importance-sampled estimator:
$$\hat\mu = \frac{1}{n} \sum_{i=1}^n h(X_i)\, \frac{p(X_i)}{q(X_i)}.$$Show that $\hat\mu$ is unbiased, derive its variance, and find the $q$ minimising that variance.
Open solution
Unbiasedness. For each $i$,
$$\mathbb{E}_q\!\left[h(X)\, \frac{p(X)}{q(X)}\right] = \int h(x)\, \frac{p(x)}{q(x)}\, q(x)\, dx = \int h(x)\, p(x)\, dx = \mu.$$So $\mathbb{E}[\hat\mu] = \mu$ regardless of $q$ (given the support condition).
Variance. Using $\mathrm{Var}(\hat\mu) = \tfrac{1}{n} \mathrm{Var}_q\!\bigl[h(X) p(X)/q(X)\bigr]$:
$$\mathrm{Var}(\hat\mu) = \frac{1}{n}\!\left[\int \frac{h^2(x)\, p^2(x)}{q(x)}\, dx - \mu^2\right].$$Optimal $q$. We minimise $\int h^2 p^2 / q$ subject to $\int q = 1$. A Lagrangian argument (or Cauchy–Schwarz) gives:
$$q^*(x) \propto |h(x)|\, p(x).$$If $h \ge 0$, the minimum variance is zero: $q^*(x) = h(x) p(x)/\mu$ makes the importance weight $\mu$ for every sample, so $\hat\mu = \mu$ exactly. The catch is that finding $\mu$ to normalise $q^*$ is the original problem. But the structural lesson survives:
Sample preferentially where $|h(x)|\, p(x)$ is large.
Practical diagnostic. The effective sample size is $n_{\text{eff}} = (\sum w_i)^2 / \sum w_i^2$ where $w_i = p(X_i)/q(X_i)$. When $n_{\text{eff}} \ll n$, the estimate is dominated by a handful of extreme weights — a sign that $q$ is a poor proposal and the variance may even be infinite.
Let $X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \mathrm{Uniform}(0, 1)$ and $M_n = \max_i X_i$. Find the CDF, PDF, and expectation of $M_n$. Then describe the limiting distribution of $n(1 - M_n)$ as $n \to \infty$.
Open solution
CDF. Max is at most $m$ iff all are at most $m$:
$$F_{M_n}(m) = P(M_n \le m) = P(X_1 \le m, \ldots, X_n \le m) = m^n, \quad m \in [0,1].$$Differentiate: $f_{M_n}(m) = n\, m^{n-1}$ on $[0, 1]$. Expectation:
$$\mathbb{E}[M_n] = \int_0^1 m \cdot n\, m^{n-1}\, dm = \frac{n}{n+1}.$$The max creeps toward $1$ but never arrives — the gap is $1/(n+1)$, of order $1/n$.
Limiting law. Consider $Y_n = n(1 - M_n)$. Its CDF:
$$P(Y_n \le y) = P\!\left(M_n \ge 1 - \tfrac{y}{n}\right) = 1 - \left(1 - \tfrac{y}{n}\right)^n \to 1 - e^{-y} \quad \text{for } y \ge 0.$$So $Y_n \xrightarrow{d} \mathrm{Exp}(1)$. The gap between the max and the upper boundary is, on the $1/n$ scale, exponentially distributed.
A gambler starts with capital $k$ and plays a fair game: on each turn her wealth moves by $\pm 1$ with probability $\tfrac{1}{2}$ each, independently. She stops when her wealth reaches either $0$ (ruin) or $N$ (target), with $0 \lt k \lt N$. Find the probability she reaches $N$ before $0$, using a martingale argument.
Open solution
Let $S_n$ be her fortune after $n$ turns; $S_0 = k$. Each step $S_{n+1} - S_n$ is $\pm 1$ with equal probability, independent of the past. Hence $\{S_n\}$ is a martingale:
$$\mathbb{E}[S_{n+1} \mid S_1, \ldots, S_n] = S_n + \mathbb{E}[\pm 1] = S_n.$$Let $\tau = \inf\{n : S_n \in \{0, N\}\}$. The chain is on a finite state space and irreducible between the boundaries, so $\tau \lt \infty$ a.s. and is bounded in expectation; $S_n$ is bounded by $N$. These conditions let us apply the optional stopping theorem:
$$\mathbb{E}[S_\tau] = \mathbb{E}[S_0] = k.$$But $S_\tau$ takes only the values $0$ or $N$:
$$\mathbb{E}[S_\tau] = 0 \cdot P(\text{ruin}) + N \cdot P(\text{reach } N).$$Equating:
$$\boxed{\,P(\text{reach } N \text{ before } 0) = \frac{k}{N}.\,}$$Remark on ruin. Start from $k = 100$ aiming for $N = 200$: you succeed with probability $1/2$. Aim for $N = 1000$ from the same starting capital $k = 100$, and the probability drops to $1/10$. A fair game at the individual step is a losing proposition against a wealthier counterparty — the institutional edge in a casino.
The biased case. If $P(\text{win}) = p \ne 1/2$ with $q = 1-p$, $S_n$ is no longer a martingale, but $(q/p)^{S_n}$ is. The same optional-stopping logic yields:
$$P(\text{reach } N) = \frac{1 - (q/p)^k}{1 - (q/p)^N}.$$Let $X_1, X_2, \ldots$ be i.i.d.\ standard normal. Show that $\limsup_n X_n / \sqrt{2 \log n} = 1$ almost surely. (Just prove the $\le 1$ direction — it is the one that uses BC I.)
Open solution
Fix $\varepsilon \gt 0$ and let $a_n = (1 + \varepsilon)\sqrt{2 \log n}$. Define $A_n = \{X_n \ge a_n\}$. The Mills ratio gives the standard Gaussian tail bound:
$$P(X_n \ge a) \le \frac{1}{a \sqrt{2\pi}}\, e^{-a^2/2}, \qquad a \gt 0.$$Substituting $a = a_n$:
$$P(A_n) \le \frac{1}{(1+\varepsilon)\sqrt{2\pi \cdot 2 \log n}}\, n^{-(1+\varepsilon)^2} = O\!\left(\frac{1}{n^{(1+\varepsilon)^2} \sqrt{\log n}}\right).$$Since $(1 + \varepsilon)^2 \gt 1$, the sum $\sum_n P(A_n)$ converges. The first Borel–Cantelli lemma then gives:
$$P(A_n \text{ i.o.}) = 0,$$i.e.\ $X_n \lt (1 + \varepsilon)\sqrt{2 \log n}$ eventually, a.s. Hence $\limsup_n X_n/\sqrt{2 \log n} \le 1 + \varepsilon$ a.s., and taking $\varepsilon \downarrow 0$ along rationals gives the stated upper bound.
The matching $\ge 1$ bound uses BC II (the independence half) applied to $A_n = \{X_n \ge (1 - \varepsilon)\sqrt{2 \log n}\}$: the sum now diverges, independence applies, and infinitely many $A_n$ occur.
Takeaway. $\sqrt{2 \log n}$ is the right scale for the maximum of $n$ i.i.d.\ Gaussians — the foundation of extreme-value results for light-tailed distributions.
Let $\mathbb{P}$ make $X \sim \mathcal{N}(0, 1)$ and $\mathbb{Q}$ make $X \sim \mathcal{N}(\mu, 1)$, on the same measurable space. Find $d\mathbb{Q}/d\mathbb{P}$ and use it to compute $\mathbb{E}_{\mathbb{P}}[X \cdot \mathbf{1}_{\{X \gt c\}}]$ in terms of $\mathbb{Q}$-probabilities.
Open solution
Both measures have densities with respect to Lebesgue, call them $\varphi$ (standard normal) and $\varphi_\mu$ (shifted). The Radon–Nikodym derivative is their ratio:
$$\frac{d\mathbb{Q}}{d\mathbb{P}}(x) = \frac{\varphi_\mu(x)}{\varphi(x)} = \exp\!\left(\mu x - \tfrac{1}{2}\mu^2\right).$$The rule: for any integrable $g$, $\mathbb{E}_{\mathbb{Q}}[g(X)] = \mathbb{E}_{\mathbb{P}}\!\bigl[g(X)\, (d\mathbb{Q}/d\mathbb{P})(X)\bigr]$. Running it in reverse, with $g(x) = x \cdot \mathbf{1}_{\{x \gt c\}}$ we want something of the form $\mathbb{E}_{\mathbb{P}}[g(X)] = \mathbb{E}_{\mathbb{Q}}[g(X)\, (d\mathbb{P}/d\mathbb{Q})(X)]$. Inverting the RN derivative:
$$\frac{d\mathbb{P}}{d\mathbb{Q}}(x) = \exp\!\left(-\mu x + \tfrac{1}{2}\mu^2\right).$$Pick $\mu = 1$ (any nonzero $\mu$ works; this one cancels the $x$ neatly). Then:
$$\mathbb{E}_{\mathbb{P}}[X \cdot \mathbf{1}_{\{X \gt c\}}] = \mathbb{E}_{\mathbb{Q}}\!\left[X\, e^{-X + 1/2} \cdot \mathbf{1}_{\{X \gt c\}}\right].$$Under $\mathbb{Q}$, $X = 1 + Z$ for $Z \sim \mathcal{N}(0,1)$, so $X\, e^{-X + 1/2} = (1 + Z)\, e^{-Z - 1/2}$. A direct Gaussian integral — after completing the square — yields:
$$\mathbb{E}_{\mathbb{P}}[X \cdot \mathbf{1}_{\{X \gt c\}}] = \varphi(c),$$the standard normal PDF at $c$. (A one-line alternative: note that $X\,\varphi(X) = -\varphi'(X)$ and integrate.)
Moral. Changing measure is a tool for evaluating awkward integrals by shifting the "centre of gravity" of the distribution. In finance, this is exactly what the risk-neutral measure does: under $\mathbb{Q}$, stocks drift at the riskless rate, turning pricing into a cleaner expectation.
Let $S_n = X_1 + \cdots + X_n$ be a sum of i.i.d.\ mean-zero random variables with $\mathrm{Var}(X_1) = \sigma^2 \lt \infty$. Show that for any $\lambda \gt 0$:
$$P\!\left(\max_{1 \le k \le n} |S_k| \ge \lambda\right) \le \frac{n \sigma^2}{\lambda^2}.$$Open solution
$S_n$ is a martingale with respect to its natural filtration. Since $x \mapsto x^2$ is convex, conditional Jensen gives:
$$\mathbb{E}[S_{k+1}^2 \mid \mathcal{F}_k] \ge (\mathbb{E}[S_{k+1} \mid \mathcal{F}_k])^2 = S_k^2,$$so $\{S_k^2\}$ is a non-negative submartingale. Apply Doob's maximal inequality to it at level $\lambda^2$:
$$P\!\left(\max_{1 \le k \le n} S_k^2 \ge \lambda^2\right) \le \frac{\mathbb{E}[S_n^2]}{\lambda^2}.$$The event $\{\max_k S_k^2 \ge \lambda^2\}$ equals $\{\max_k |S_k| \ge \lambda\}$, and $\mathbb{E}[S_n^2] = n \sigma^2$ (variance of sum of independents). The claim follows.
Comparison with Chebyshev. The naïve Chebyshev bound $P(|S_n| \ge \lambda) \le n\sigma^2 / \lambda^2$ controls only the final position. Doob upgrades this to control over the entire trajectory, at no cost. That jump — from a pointwise bound to a uniform-in-time bound — is the signature superpower of martingale techniques, and appears throughout online learning regret analyses.
Using Itô's formula, compute $d(W_t^2)$ and use the result to show $\mathbb{E}[W_t^2] = t$. Then compute $\mathbb{E}[W_t^4]$.
Open solution
Square. Apply Itô to $f(x) = x^2$ with $X_t = W_t$, so $\mu_t = 0$, $\sigma_t = 1$. Then $f'(x) = 2x$, $f''(x) = 2$:
$$d(W_t^2) = 2 W_t\, dW_t + \tfrac{1}{2} \cdot 2 \cdot 1 \cdot dt = 2 W_t\, dW_t + dt.$$Integrating from $0$ to $t$:
$$W_t^2 = 2 \int_0^t W_s\, dW_s + t.$$Take expectations. The Itô integral $\int_0^t W_s\, dW_s$ is a martingale starting at $0$, so its expectation vanishes, and:
$$\mathbb{E}[W_t^2] = t. \quad \checkmark$$We have just recovered a fact we already knew from the Gaussian definition, but through a method that doesn't need the Gaussian structure — it works for any Itô process.
Fourth moment. Apply Itô to $g(x) = x^4$: $g'(x) = 4x^3$, $g''(x) = 12 x^2$. With $X_t = W_t$:
$$d(W_t^4) = 4 W_t^3\, dW_t + \tfrac{1}{2} \cdot 12\, W_t^2\, dt = 4 W_t^3\, dW_t + 6 W_t^2\, dt.$$Integrate and take expectations. The Itô integral contributes zero:
$$\mathbb{E}[W_t^4] = 6 \int_0^t \mathbb{E}[W_s^2]\, ds = 6 \int_0^t s\, ds = 3 t^2.$$Consistent with $W_t \sim \mathcal{N}(0, t)$ (fourth moment of a Gaussian is $3\sigma^4$). Itô's formula has effectively converted a moment computation into an ODE.
Takeaway. Itô's formula systematically reduces the moments of Itô processes to deterministic integrals — and this is the reason stochastic calculus can price options in closed form.
Verify by Itô's formula that $S_t = S_0 \exp\!\bigl((\mu - \tfrac{1}{2}\sigma^2) t + \sigma W_t\bigr)$ solves the SDE $dS_t = \mu S_t\, dt + \sigma S_t\, dW_t$. Then compute $\mathbb{E}[S_t]$ and $\mathrm{Var}(S_t)$.
Open solution
Verification. Set $X_t = (\mu - \tfrac{1}{2}\sigma^2) t + \sigma W_t$, so $dX_t = (\mu - \tfrac{1}{2}\sigma^2)\, dt + \sigma\, dW_t$. Apply Itô to $f(x) = S_0 e^x$, noting $f'(x) = f(x) = S_t$ and $f''(x) = S_t$, with diffusion coefficient of $X_t$ equal to $\sigma$:
$$dS_t = S_t\, dX_t + \tfrac{1}{2} S_t\, \sigma^2\, dt = S_t\bigl[(\mu - \tfrac{1}{2}\sigma^2)\, dt + \sigma\, dW_t\bigr] + \tfrac{1}{2} \sigma^2 S_t\, dt.$$The $-\tfrac{1}{2}\sigma^2$ inside $dX_t$ cancels against the $+\tfrac{1}{2}\sigma^2$ from the Itô correction. What remains:
$$dS_t = \mu S_t\, dt + \sigma S_t\, dW_t. \quad \checkmark$$The exponent's strange-looking $-\tfrac{1}{2}\sigma^2$ is exactly what the Itô correction demands.
Moments. $S_t = S_0 \exp(Y_t)$ where $Y_t \sim \mathcal{N}\bigl((\mu - \tfrac{1}{2}\sigma^2)t,\ \sigma^2 t\bigr)$. Use the log-normal MGF $\mathbb{E}[e^{aY}] = \exp\!\bigl(a\, m + \tfrac{1}{2} a^2 v\bigr)$ for $Y \sim \mathcal{N}(m, v)$:
$$\mathbb{E}[S_t] = S_0 \exp\!\bigl((\mu - \tfrac{1}{2}\sigma^2) t + \tfrac{1}{2} \sigma^2 t\bigr) = S_0\, e^{\mu t}.$$A clean and reassuring result: the geometric-Brownian-motion asset grows at the deterministic rate $\mu$ in expectation, even though each path is noisy and the median grows slower (at $\mu - \tfrac{1}{2}\sigma^2$, the Itô-corrected drift). The gap $\tfrac{1}{2}\sigma^2$ is called the volatility drag.
For the second moment, apply the MGF again with $a = 2$:
$$\mathbb{E}[S_t^2] = S_0^2 \exp\!\bigl(2(\mu - \tfrac{1}{2}\sigma^2) t + 2 \sigma^2 t\bigr) = S_0^2\, e^{(2\mu + \sigma^2) t}.$$Subtracting:
$$\mathrm{Var}(S_t) = S_0^2\, e^{2 \mu t}\bigl(e^{\sigma^2 t} - 1\bigr).$$An A/B test compares two conversion rates. Group $A$ has $n_A$ users with sample mean $\bar X_A$; group $B$ has $n_B$ users with sample mean $\bar X_B$. Assume known variances $\sigma_A^2, \sigma_B^2$ for simplicity. Derive the test statistic for $H_0: \mu_A = \mu_B$ against $H_1: \mu_A \gt \mu_B$, and compute the sample size needed for power $1 - \beta$ at a specified effect size $\Delta = \mu_A - \mu_B$.
Open solution
Test statistic. By the CLT (or exactly, if data are Gaussian), $\bar X_A - \bar X_B$ is approximately normal with mean $\mu_A - \mu_B$ and variance $\sigma_A^2/n_A + \sigma_B^2/n_B$. Under $H_0$:
$$Z = \frac{\bar X_A - \bar X_B}{\sqrt{\sigma_A^2/n_A + \sigma_B^2/n_B}} \sim \mathcal{N}(0, 1).$$Reject $H_0$ at level $\alpha$ when $Z \gt z_\alpha$.
Power calculation. Under $H_1$ with true difference $\Delta$, $Z$ shifts:
$$Z \sim \mathcal{N}\!\left(\frac{\Delta}{\sqrt{\sigma_A^2/n_A + \sigma_B^2/n_B}},\ 1\right).$$Power is:
$$1 - \beta = P(Z \gt z_\alpha \mid H_1) = \Phi\!\left(\frac{\Delta}{\sqrt{\sigma_A^2/n_A + \sigma_B^2/n_B}} - z_\alpha\right).$$Sample size. Take $n_A = n_B = n$ and $\sigma_A = \sigma_B = \sigma$. Setting $\Phi^{-1}(1 - \beta) = z_\beta$ and rearranging:
$$n = \frac{2 \sigma^2 (z_\alpha + z_\beta)^2}{\Delta^2}.$$Detecting half the effect requires four times the sample — the same square-root law that haunted Exercise 9. Shrinking $\sigma$ (e.g.\ by better instrumentation or stratification) is the most efficient lever, because $n$ depends on $\sigma^2$.
Let $X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu, \sigma^2)$ with both parameters unknown. Compute the $2 \times 2$ Fisher information matrix $\mathcal{I}(\mu, \sigma^2)$. Show the sample mean achieves the Cramér–Rao bound for $\mu$; find the CRB for any unbiased estimator of $\sigma^2$.
Open solution
Per-observation log-density:
$$\log f(x; \mu, \sigma^2) = -\tfrac{1}{2} \log(2\pi \sigma^2) - \frac{(x - \mu)^2}{2 \sigma^2}.$$First derivatives (the score):
$$\frac{\partial \log f}{\partial \mu} = \frac{x - \mu}{\sigma^2}, \qquad \frac{\partial \log f}{\partial \sigma^2} = -\frac{1}{2\sigma^2} + \frac{(x - \mu)^2}{2 \sigma^4}.$$Second derivatives:
$$\frac{\partial^2 \log f}{\partial \mu^2} = -\frac{1}{\sigma^2}, \qquad \frac{\partial^2 \log f}{\partial \mu\, \partial \sigma^2} = -\frac{x - \mu}{\sigma^4}, \qquad \frac{\partial^2 \log f}{\partial (\sigma^2)^2} = \frac{1}{2\sigma^4} - \frac{(x - \mu)^2}{\sigma^6}.$$Taking $-\mathbb{E}$ kills the off-diagonal (since $\mathbb{E}[X - \mu] = 0$) and simplifies the diagonal using $\mathbb{E}[(X-\mu)^2] = \sigma^2$:
$$\mathcal{I}(\mu, \sigma^2) = \begin{pmatrix} 1/\sigma^2 & 0 \\ 0 & 1/(2\sigma^4) \end{pmatrix}.$$The orthogonality of $\mu$ and $\sigma^2$ in the Fisher metric is what makes Gaussian inference so clean — the MLEs of $\mu$ and $\sigma^2$ are asymptotically independent.
CRBs. With $n$ observations the Fisher information scales by $n$. For any unbiased estimator:
$$\mathrm{Var}(\hat\mu) \ge \frac{\sigma^2}{n}, \qquad \mathrm{Var}(\widehat{\sigma^2}) \ge \frac{2 \sigma^4}{n}.$$The sample mean has variance exactly $\sigma^2/n$ — efficient. The unbiased sample variance $S^2 = \tfrac{1}{n-1}\sum(X_i - \bar X)^2$ has variance $2\sigma^4/(n-1)$, slightly exceeding the CRB; no unbiased estimator achieves it exactly.
Let $X \sim \mathcal{N}(\mu, \sigma^2)$ and define $Y = e^X$. Derive the density of $Y$ (the log-normal), then compute $\mathbb{E}[Y]$ two different ways: by direct integration and by recognising the normal MGF.
Open solution
Density. The map $g(x) = e^x$ is a smooth, strictly-increasing bijection $\mathbb{R} \to (0, \infty)$, with inverse $g^{-1}(y) = \log y$ and Jacobian $\tfrac{d}{dy} \log y = 1/y$. Applying the change-of-variables formula:
$$f_Y(y) = f_X(\log y)\, \left|\frac{1}{y}\right| = \frac{1}{y\, \sigma \sqrt{2\pi}}\, \exp\!\left(-\frac{(\log y - \mu)^2}{2\sigma^2}\right), \quad y \gt 0.$$Expectation via MGF. By definition, $\mathbb{E}[Y] = \mathbb{E}[e^X] = M_X(1)$, where the MGF of $\mathcal{N}(\mu, \sigma^2)$ is $M_X(t) = \exp(\mu t + \tfrac{1}{2}\sigma^2 t^2)$. Substituting $t = 1$:
$$\mathbb{E}[Y] = \exp\!\left(\mu + \tfrac{1}{2}\sigma^2\right).$$Expectation by direct integration. Complete the square:
$$\mathbb{E}[Y] = \int_{-\infty}^\infty \frac{e^x}{\sigma \sqrt{2\pi}}\, \exp\!\left(-\frac{(x - \mu)^2}{2 \sigma^2}\right) dx.$$Combining exponents: $-\tfrac{(x - \mu)^2}{2\sigma^2} + x = -\tfrac{1}{2\sigma^2}\!\left[x - (\mu + \sigma^2)\right]^2 + \mu + \tfrac{1}{2}\sigma^2$. The remaining integrand is a Gaussian density centred at $\mu + \sigma^2$ and integrates to $1$:
$$\mathbb{E}[Y] = \exp\!\left(\mu + \tfrac{1}{2}\sigma^2\right). \checkmark$$The $+\tfrac{1}{2}\sigma^2$ is the same quantity that appeared with opposite sign in the GBM solution (Exercise 20). Both are fingerprints of Itô / Jensen working in a world where $\log$ is concave: you cannot freely move a log across an expectation.
A binary symmetric channel transmits $X \in \{0, 1\}$ and outputs $Y = X \oplus N$ where $\oplus$ is XOR and $N \sim \mathrm{Bernoulli}(\varepsilon)$ is an independent noise bit ($0 \le \varepsilon \le 1/2$). Compute $I(X; Y)$ when $X \sim \mathrm{Bernoulli}(p)$, and find the channel capacity $C = \max_p I(X; Y)$.
Open solution
Write $H_2(q) = -q \log q - (1-q) \log(1-q)$ for the binary entropy function (base 2 gives bits; use natural log for nats — all formulas scale).
Conditional entropy. Given $X = x$, $Y = x \oplus N$, so $Y \mid X \sim \mathrm{Bernoulli}(\varepsilon)$ regardless of $x$. Hence:
$$H(Y \mid X) = H_2(\varepsilon).$$Marginal entropy. $P(Y = 1) = P(X = 1) P(N = 0) + P(X = 0) P(N = 1) = p(1 - \varepsilon) + (1 - p) \varepsilon$. Call this $q$. Then:
$$H(Y) = H_2(q), \qquad q = p(1 - \varepsilon) + (1 - p) \varepsilon.$$Mutual information.
$$I(X; Y) = H(Y) - H(Y \mid X) = H_2(q) - H_2(\varepsilon).$$Capacity. $H_2(q)$ is maximised at $q = 1/2$, which occurs when $p = 1/2$ (and any $\varepsilon$). So:
$$C = 1 - H_2(\varepsilon) \quad \text{(bits per channel use)}.$$Limits:
- $\varepsilon = 0$ (perfect channel): $C = 1$ bit.
- $\varepsilon = 1/2$ (output independent of input): $C = 0$.
- $\varepsilon = 1$ (bit-flipped deterministically): $C = 1$ again — just invert the receiver.
Two independent Poisson processes $N_1, N_2$ on $[0, \infty)$ have rates $\lambda_1, \lambda_2$. (a) Show that the superposition $N = N_1 + N_2$ is a Poisson process and find its rate. (b) Given an event from $N$ occurs, what is the probability it came from process $1$? (c) Consider thinning: in a rate-$\lambda$ Poisson process, each event is independently kept with probability $p$. Show the kept events form a Poisson process with rate $\lambda p$.
Open solution
(a) Superposition. $N$ starts at $0$. For disjoint intervals, the increments of $N_1$ and $N_2$ are independent of each other and across intervals, so $N$ has independent increments too. For any interval of length $t$:
$$N_1(t) + N_2(t) \sim \mathrm{Poisson}(\lambda_1 t) + \mathrm{Poisson}(\lambda_2 t) \stackrel{d}{=} \mathrm{Poisson}((\lambda_1 + \lambda_2) t),$$using the fact that independent Poissons sum to a Poisson (check via MGFs). So $N$ is Poisson with rate $\lambda_1 + \lambda_2$.
(b) Colouring. Given an event happens in the superposition at some instant, by symmetry and independence the probability it is from process $1$ is $\lambda_1 / (\lambda_1 + \lambda_2)$ — independently for each event. Each arrival's origin is an i.i.d.\ categorical draw.
(c) Thinning. Let $M(t)$ be the number of kept events in $[0, t]$. Conditional on $N(t) = n$, $M(t) \sim \mathrm{Binomial}(n, p)$. Unconditioning:
$$P(M(t) = k) = \sum_{n = k}^\infty \frac{(\lambda t)^n e^{-\lambda t}}{n!} \binom{n}{k} p^k (1 - p)^{n - k}.$$Factor out $k$-dependent terms and reindex with $m = n - k$:
$$= \frac{(\lambda t p)^k e^{-\lambda t}}{k!} \sum_{m = 0}^\infty \frac{(\lambda t (1 - p))^m}{m!} = \frac{(\lambda p t)^k e^{-\lambda p t}}{k!}.$$That is Poisson$(\lambda p t)$. Independence of increments is inherited from $N$, so the kept process is itself Poisson with rate $\lambda p$. The discarded events form an independent Poisson process with rate $\lambda(1 - p)$.
Cast the Bernoulli distribution in exponential-family form $p(x \mid \eta) = h(x)\, \exp\!\bigl(\eta\, T(x) - A(\eta)\bigr)$. Identify the natural parameter $\eta$, sufficient statistic $T(x)$, log-partition $A(\eta)$, and verify the identity $A'(\eta) = \mathbb{E}[T]$.
Open solution
Start from $p(x \mid \theta) = \theta^x (1 - \theta)^{1-x}$ for $x \in \{0, 1\}$. Take logs and re-arrange:
$$\log p(x \mid \theta) = x \log \theta + (1 - x) \log (1 - \theta) = x\, \log\!\frac{\theta}{1 - \theta} + \log(1 - \theta).$$Identify:
- Natural parameter: $\eta = \log \frac{\theta}{1 - \theta}$ — the logit of $\theta$.
- Sufficient statistic: $T(x) = x$.
- Base measure: $h(x) = 1$.
- Log-partition: $A(\eta) = -\log(1 - \theta) = \log(1 + e^\eta)$ — the softplus.
So $p(x \mid \eta) = \exp(\eta x - \log(1 + e^\eta))$ — a compact form that immediately tells us something useful.
The cumulant identity. For any exponential family, $A(\eta)$ is convex in $\eta$ and its derivatives generate cumulants of $T$. In particular:
$$A'(\eta) = \mathbb{E}[T(X)], \qquad A''(\eta) = \mathrm{Var}(T(X)).$$Check for Bernoulli: $A'(\eta) = \dfrac{e^\eta}{1 + e^\eta} = \sigma(\eta)$, the logistic sigmoid. And $\sigma(\eta) = \theta = \mathbb{E}[X]$. ✓ The second derivative: $A''(\eta) = \sigma(\eta)(1 - \sigma(\eta)) = \theta(1 - \theta) = \mathrm{Var}(X)$. ✓
Consequences. For $n$ i.i.d.\ observations, $\sum_i T(X_i) = \sum_i X_i$ is sufficient (Fisher–Neyman factorisation). The MLE in terms of $\eta$ satisfies $A'(\hat\eta) = \bar T = \bar X$, i.e.\ $\hat\theta = \bar X$ — the logit of the sample mean is the MLE of the natural parameter. Fisher information in natural parameters is $A''(\eta)$.
Let $X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \mathrm{Exp}(\lambda)$. In Exercise 4 we showed the MLE is $\hat\lambda = 1/\bar X_n$. Derive its asymptotic distribution via the delta method, starting from the CLT for $\bar X_n$.
Open solution
Step 1 — CLT for the mean. $X_i$ has mean $1/\lambda$ and variance $1/\lambda^2$. By the CLT:
$$\sqrt{n}\!\left(\bar X_n - \frac{1}{\lambda}\right) \xrightarrow{d} \mathcal{N}\!\left(0,\ \frac{1}{\lambda^2}\right).$$Step 2 — Delta method. Apply $g(x) = 1/x$, which is differentiable at $x = 1/\lambda$ with $g'(x) = -1/x^2$, so $g'(1/\lambda) = -\lambda^2$. The delta method gives:
$$\sqrt{n}\!\left(\frac{1}{\bar X_n} - \lambda\right) \xrightarrow{d} \mathcal{N}\!\left(0,\ [g'(1/\lambda)]^2 \cdot \frac{1}{\lambda^2}\right) = \mathcal{N}(0,\ \lambda^2).$$This matches precisely the MLE asymptotic variance we computed in Exercise 4 from Fisher information. Two paths to the same destination: the delta method (first-order Taylor expansion plus Slutsky) and the Fisher-information formula. They agree because MLEs are asymptotically normal with variance equal to the inverse Fisher information — and the $g$-transformation Fisher information transforms covariantly.
Constructing a CI. The delta-method form gives an asymptotic $(1 - \alpha)$ confidence interval:
$$\hat\lambda \pm z_{\alpha/2} \cdot \frac{\hat\lambda}{\sqrt n}.$$We substitute the unknown $\lambda$ by $\hat\lambda$ in the standard error — a step justified by Slutsky's theorem.
Given i.i.d.\ observations $X_1, \ldots, X_n \sim F$ with unknown mean $\mu = \mathbb{E}_F[X]$, the percentile bootstrap CI for $\mu$ is: resample with replacement $B$ times to form datasets $\mathbf{X}^{(b)}$; compute $\bar X^{(b)}$ for each; report the empirical $(\alpha/2, 1 - \alpha/2)$ quantiles of $\{\bar X^{(b)}\}_{b=1}^B$. Justify why this procedure is (approximately) a valid $(1 - \alpha)$ CI, and give one setting where it fails catastrophically.
Open solution
The plug-in principle. The unknown sampling distribution of $\bar X - \mu$ under $F$ is approximated by the conditional distribution of $\bar X^* - \bar X_n$ under the empirical $F_n$ (with $\bar X_n$ the original sample mean). As $n \to \infty$, $F_n \to F$ uniformly (Glivenko–Cantelli), so the approximation is sound for smooth functionals.
Why quantiles work. Write $\hat\theta = \bar X_n$ and $\hat\theta^* = \bar X^*$. If the distribution of $\hat\theta - \mu$ is well approximated by that of $\hat\theta^* - \hat\theta$, then for quantile $q_\alpha$ of the bootstrap distribution of $\hat\theta^* - \hat\theta$:
$$P\!\bigl(q_{\alpha/2} \le \hat\theta - \mu \le q_{1 - \alpha/2}\bigr) \approx 1 - \alpha.$$Rearranging gives $\hat\theta - q_{1 - \alpha/2} \le \mu \le \hat\theta - q_{\alpha/2}$ — the basic bootstrap. The percentile bootstrap uses quantiles of $\hat\theta^*$ directly; the two agree when the distribution is symmetric around $\hat\theta$.
Edgeworth refinement. Bootstrap has $O(1/n)$ coverage error, one order better than normal-approximation CIs, because it captures skewness automatically. This is one reason it is the default in many applied fields.
Where it fails.
- Extremes. For $\hat\theta = \max X_i$, the bootstrap is inconsistent: resamples can never exceed the observed maximum, so bootstrap variance is systematically wrong. Subsampling or the $m$-out-of-$n$ bootstrap fixes this.
- Heavy tails. If $\mathbb{E}[X^2] = \infty$, the CLT itself fails, and so does the bootstrap for the sample mean.
- Dependent data. Naïve bootstrap destroys temporal or spatial correlation. Use the block bootstrap for time series.
A two-component Gaussian mixture has density $p(x) = \pi\, \mathcal{N}(x; \mu_1, 1) + (1 - \pi)\, \mathcal{N}(x; \mu_2, 1)$ with known variances ($=1$) and mixing weight $\pi$. Given i.i.d.\ data $x_1, \ldots, x_n$, derive the EM updates for $(\mu_1, \mu_2)$.
Open solution
Introduce latent variables $z_i \in \{1, 2\}$ indicating which component generated $x_i$. If we knew $z_i$, the complete-data log-likelihood would be trivially separable into two Gaussian likelihoods, each giving $\mu_k = $ mean of points assigned to component $k$.
E-step. Given current parameters, compute the posterior responsibility:
$$\gamma_i = P(z_i = 1 \mid x_i) = \frac{\pi\, \mathcal{N}(x_i; \mu_1, 1)}{\pi\, \mathcal{N}(x_i; \mu_1, 1) + (1 - \pi)\, \mathcal{N}(x_i; \mu_2, 1)}.$$This is the soft assignment.
M-step. Maximise the expected complete-data log-likelihood:
$$Q(\mu_1, \mu_2) = \sum_i \gamma_i \left[-\tfrac{1}{2}(x_i - \mu_1)^2\right] + \sum_i (1 - \gamma_i) \left[-\tfrac{1}{2}(x_i - \mu_2)^2\right] + \text{const}.$$Setting $\partial Q/\partial \mu_k = 0$:
$$\mu_1^{\text{new}} = \frac{\sum_i \gamma_i\, x_i}{\sum_i \gamma_i}, \qquad \mu_2^{\text{new}} = \frac{\sum_i (1 - \gamma_i)\, x_i}{\sum_i (1 - \gamma_i)}.$$Each mean is a responsibility-weighted average — a soft version of $k$-means.
Why EM works. Let $\ell(\theta) = \log p(x \mid \theta)$ be the observed-data log-likelihood. For any $q(z)$:
$$\ell(\theta) = \underbrace{\mathbb{E}_q[\log p(x, z \mid \theta)] - \mathbb{E}_q[\log q(z)]}_{\text{ELBO}} + \underbrace{D_{\mathrm{KL}}(q(z) \,\|\, p(z \mid x, \theta))}_{\ge 0}.$$E-step minimises the KL term by setting $q(z) = p(z \mid x, \theta^{\text{old}})$ — the ELBO becomes tight. M-step maximises the ELBO in $\theta$. Both steps increase (or preserve) $\ell$, so EM is monotone — it cannot decrease the likelihood, and converges to a stationary point (not necessarily the global maximum; initialisation matters).
Given a latent-variable model $p(x, z \mid \theta)$ with marginal likelihood $p(x \mid \theta) = \int p(x, z \mid \theta)\, dz$, and any distribution $q(z)$ on the latents, derive the evidence lower bound (ELBO) on $\log p(x \mid \theta)$. Characterise exactly when the bound is tight.
Open solution
Derivation via Jensen. Multiply and divide by $q(z)$ inside the integral, then apply Jensen's inequality (log is concave):
$$\log p(x) = \log \int q(z)\, \frac{p(x, z)}{q(z)}\, dz \ge \int q(z) \log \frac{p(x, z)}{q(z)}\, dz =: \mathcal{L}(q, \theta).$$The right-hand side is the ELBO. Equivalently:
$$\mathcal{L}(q, \theta) = \mathbb{E}_q[\log p(x \mid z, \theta)] - D_{\mathrm{KL}}(q(z) \,\|\, p(z \mid \theta)).$$The first term is a reconstruction-like likelihood under the latent posterior $q$; the second penalises departure of $q$ from the prior on $z$. This decomposition is what the VAE optimises.
The gap. A short manipulation:
$$\log p(x) - \mathcal{L}(q, \theta) = \int q(z)\, \log \frac{q(z)}{p(z \mid x, \theta)}\, dz = D_{\mathrm{KL}}(q(z) \,\|\, p(z \mid x, \theta)).$$So the bound is tight if and only if $q(z) = p(z \mid x, \theta)$ — i.e.\ $q$ is the true posterior. When the true posterior is intractable, we restrict $q$ to a tractable family and minimise the KL, equivalently maximising the ELBO.
Two classical regimes.
- EM (Exercise 29): we can compute $p(z \mid x, \theta)$ exactly. Set $q = p(z \mid x, \theta^{\text{old}})$ (E-step), then maximise $\mathcal{L}$ in $\theta$ (M-step). Alternating on a tight bound.
- Variational Bayes / VAEs: parametrise $q_\phi(z \mid x)$ by a neural network and jointly maximise $\mathcal{L}$ in both $\theta$ and $\phi$ via stochastic gradients (the reparameterisation trick makes this differentiable).