Linear Algebra & Differential Equations — A Cheatsheet for ML and Quant Research

Part I

Linear Algebra

The grammar of machine learning — where every model is a map between vector spaces.

01Vectors, norms, and inner products

A vector $\mathbf{x}\in\mathbb{R}^n$ is an ordered tuple $\mathbf{x}=(x_1,\dots,x_n)^\top$. Vectors live in a vector space: a set closed under addition and scalar multiplication with the usual axioms. In ML, data points, gradients, and parameters are all vectors; in quant finance, asset returns, portfolio weights, and yield-curve shifts are vectors.

Inner product

The Euclidean inner product of $\mathbf{x},\mathbf{y}\in\mathbb{R}^n$ is $$\langle\mathbf{x},\mathbf{y}\rangle \;=\; \mathbf{x}^\top\mathbf{y} \;=\; \sum_{i=1}^n x_i y_i \;=\; \|\mathbf{x}\|\,\|\mathbf{y}\|\cos\theta.$$ It measures alignment: orthogonality $\iff \langle\mathbf{x},\mathbf{y}\rangle=0$. The Cauchy–Schwarz inequality states $|\langle\mathbf{x},\mathbf{y}\rangle|\le\|\mathbf{x}\|\,\|\mathbf{y}\|$.

Norms

$\ell_2$ (Euclidean)

$\|\mathbf{x}\|_2 = \sqrt{\sum_i x_i^2}$. The default in regression, kernels, optimization.

$\ell_1$ (Manhattan)

$\|\mathbf{x}\|_1 = \sum_i |x_i|$. Induces sparsity (Lasso).

$\ell_\infty$

$\|\mathbf{x}\|_\infty = \max_i |x_i|$. Worst-case / minimax bounds.

$\ell_p$, general

$\|\mathbf{x}\|_p = \left(\sum_i |x_i|^p\right)^{1/p}$, convex for $p\ge 1$.

Equivalence On $\mathbb{R}^n$, all norms are equivalent: for any two norms there exist $c,C>0$ with $c\|\mathbf{x}\|_a\le\|\mathbf{x}\|_b\le C\|\mathbf{x}\|_a$. Useful when moving between theoretical bounds and practical computation.

Geometry you must internalize

Angle: $\cos\theta = \dfrac{\mathbf{x}^\top\mathbf{y}}{\|\mathbf{x}\|\|\mathbf{y}\|}$ — this is cosine similarity.
Projection of $\mathbf{x}$ onto $\mathbf{y}$: $\mathrm{proj}_\mathbf{y}\mathbf{x}=\dfrac{\mathbf{x}^\top\mathbf{y}}{\mathbf{y}^\top\mathbf{y}}\,\mathbf{y}$.
Parallelogram law: $\|\mathbf{x}+\mathbf{y}\|^2+\|\mathbf{x}-\mathbf{y}\|^2=2\|\mathbf{x}\|^2+2\|\mathbf{y}\|^2$.

02Matrices and their algebra

A matrix $A\in\mathbb{R}^{m\times n}$ represents a linear map $\mathbb{R}^n\to\mathbb{R}^m$ via $\mathbf{x}\mapsto A\mathbf{x}$. Column $j$ of $A$ is the image of $\mathbf{e}_j$, the $j$th standard basis vector. Matrix multiplication $AB$ composes maps; it is associative, distributive, but not commutative.

Transpose: $(A^\top)_{ij}=A_{ji}$. Properties: $(AB)^\top=B^\top A^\top$, $(A^\top)^{-1}=(A^{-1})^\top$.

Inverse: $A^{-1}$ exists iff $A$ is square and $\det A\neq 0$. Then $AA^{-1}=A^{-1}A=I$. For $2\times 2$: $\begin{pmatrix}a&b\\c&d\end{pmatrix}^{-1}=\dfrac{1}{ad-bc}\begin{pmatrix}d&-b\\-c&a\end{pmatrix}$.

Block matrices multiply blockwise when dimensions conform. A 2×2 block-diagonal inverse is $\mathrm{diag}(A,B)^{-1}=\mathrm{diag}(A^{-1},B^{-1})$.

Schur complement: if $M=\begin{pmatrix}A&B\\C&D\end{pmatrix}$ with $A$ invertible, the Schur complement is $M/A = D-CA^{-1}B$, and $\det M=\det A\cdot\det(M/A)$. This appears in Gaussian conditionals and in LDL factorizations.

Woodbury identity (low-rank updates, central to Kalman filtering and kernel methods): $$(A+UCV)^{-1}=A^{-1}-A^{-1}U(C^{-1}+VA^{-1}U)^{-1}VA^{-1}.$$

Matrix norms

Norm	Definition	Use
Frobenius	$\\|A\\|_F=\sqrt{\sum_{ij}A_{ij}^2}=\sqrt{\mathrm{tr}(A^\top A)}$	Default "size" of a matrix; matches $\ell_2$ on $\mathrm{vec}(A)$.
Spectral ($\ell_2$-op)	$\\|A\\|_2 = \sigma_{\max}(A) = \sup_{\\|\mathbf{x}\\|=1}\\|A\mathbf{x}\\|$	Lipschitz constant of the linear map; controls stability.
Nuclear	$\\|A\\|_* = \sum_i \sigma_i(A)$	Convex surrogate of rank (matrix completion).
$\ell_1$/$\ell_\infty$ induced	Max col-sum / max row-sum of $\|A\|$	Easy bounds; used in numerical analysis.

03Rank, span, basis, and the four fundamental subspaces

The span of vectors $\{\mathbf{v}_1,\dots,\mathbf{v}_k\}$ is the set of all their linear combinations. A set is linearly independent if $\sum c_i\mathbf{v}_i=\mathbf{0}\implies c_i=0$ for all $i$. A basis of a subspace is a linearly independent spanning set; its size is the dimension.

For $A\in\mathbb{R}^{m\times n}$, Strang's four subspaces are:

Column space $\mathcal{C}(A)\subseteq\mathbb{R}^m$

Span of columns. $A\mathbf{x}=\mathbf{b}$ is solvable $\iff \mathbf{b}\in\mathcal{C}(A)$. Dimension $= r$ (rank).

Null space $\mathcal{N}(A)\subseteq\mathbb{R}^n$

$\{\mathbf{x}:A\mathbf{x}=\mathbf{0}\}$. Dimension $n-r$. Orthogonal complement of $\mathcal{C}(A^\top)$.

Row space $\mathcal{C}(A^\top)\subseteq\mathbb{R}^n$

Span of rows. Dimension $= r$. Orthogonal to $\mathcal{N}(A)$.

Left null space $\mathcal{N}(A^\top)\subseteq\mathbb{R}^m$

$\{\mathbf{y}:A^\top\mathbf{y}=\mathbf{0}\}$. Dimension $m-r$. Orthogonal to $\mathcal{C}(A)$.

Rank-nullity For $A\in\mathbb{R}^{m\times n}$: $\;\mathrm{rank}(A)+\dim\mathcal{N}(A)=n.$ Row-rank $=$ column-rank.

Rank is preserved under invertible maps: $\mathrm{rank}(PAQ)=\mathrm{rank}(A)$ for invertible $P,Q$. Also $\mathrm{rank}(AB)\le\min(\mathrm{rank}(A),\mathrm{rank}(B))$ and $\mathrm{rank}(A^\top A)=\mathrm{rank}(A)$.

04Determinants and trace

The determinant $\det A$ of a square matrix is the signed $n$-dimensional volume of the parallelepiped spanned by its columns. Properties:

$\det(AB)=\det A\cdot\det B$.
$\det(A^\top)=\det A$; $\det(A^{-1})=1/\det A$; $\det(cA)=c^n\det A$.
$\det A=\prod_i \lambda_i$ (product of eigenvalues $\lambda_i$ of $A$, counted with multiplicity; see §5).
Row swap flips sign; scaling a row by $c$ scales $\det$ by $c$; row addition preserves $\det$.

The trace is $\mathrm{tr}(A)=\sum_i A_{ii}=\sum_i \lambda_i$. Key identities:

$\mathrm{tr}(AB)=\mathrm{tr}(BA)$ (cyclic).
$\mathrm{tr}(A+B)=\mathrm{tr}(A)+\mathrm{tr}(B)$.
$\mathrm{tr}(\mathbf{x}\mathbf{y}^\top)=\mathbf{y}^\top\mathbf{x}$ — swap outer-product trace for inner product.

ML use The log-determinant $\log\det\Sigma$ appears in Gaussian log-likelihoods. For large $\Sigma$, compute via Cholesky: $\log\det\Sigma = 2\sum_i \log L_{ii}$ where $\Sigma=LL^\top$. Never form $\det\Sigma$ directly — it overflows.

05Eigenvalues and eigenvectors

Definition $(\lambda,\mathbf{v})$ with $\mathbf{v}\neq\mathbf{0}$ is an eigenpair of $A\in\mathbb{R}^{n\times n}$ if $A\mathbf{v}=\lambda\mathbf{v}$. The eigenvalues are roots of the characteristic polynomial $p_A(\lambda)=\det(A-\lambda I)$.

If $A$ has $n$ linearly independent eigenvectors, it is diagonalizable: $$A=V\Lambda V^{-1},\qquad \Lambda=\mathrm{diag}(\lambda_1,\dots,\lambda_n),\quad V=[\mathbf{v}_1\,\cdots\,\mathbf{v}_n].$$ Then $A^k=V\Lambda^k V^{-1}$ and $f(A)=V f(\Lambda) V^{-1}$ for analytic $f$. This is why eigendecomposition accelerates matrix powers (Markov chains, PageRank, recurrent neural network — RNN — dynamics) and matrix functions (matrix exponentials).

Facts worth memorizing

$\sum\lambda_i=\mathrm{tr}(A)$ and $\prod\lambda_i=\det A$.
Similar matrices $A$ and $P^{-1}AP$ share eigenvalues.
Real symmetric matrices have real eigenvalues and an orthonormal eigenbasis (Spectral Theorem).
If $A\mathbf{v}=\lambda\mathbf{v}$ then $(A-\mu I)\mathbf{v}=(\lambda-\mu)\mathbf{v}$ — the shift trick used in deflation and iterative solvers.
Gershgorin discs: each eigenvalue lies in $\bigcup_i \{z : |z-A_{ii}|\le \sum_{j\neq i}|A_{ij}|\}$.
Power iteration: $\mathbf{v}_{k+1}=A\mathbf{v}_k/\|A\mathbf{v}_k\|$ converges to the dominant eigenvector when $|\lambda_1|>|\lambda_2|$.

06Symmetric, positive-definite, orthogonal

Symmetric

$A=A^\top$. Real eigenvalues; orthogonal eigenvectors. Diagonalizable as $A=Q\Lambda Q^\top$ with $Q$ orthogonal.

Orthogonal

$Q^\top Q = I$. Preserves norms and angles; $\det Q=\pm1$; eigenvalues lie on the unit circle. Rotations/reflections.

Positive semidefinite (PSD)

Symmetric with $\mathbf{x}^\top A\mathbf{x}\ge 0\;\forall \mathbf{x}$. Eigenvalues $\lambda_i\ge 0$. Covariance matrices and kernel Gram matrices are PSD.

Positive definite (PD)

$\mathbf{x}^\top A\mathbf{x}>0\;\forall\mathbf{x}\neq\mathbf{0}$. All $\lambda_i>0$. Admits a unique Cholesky factorization $A=LL^\top$.

Projection

$P^2=P$. Orthogonal if additionally $P=P^\top$; then $P=QQ^\top$ for orthonormal $Q$. Eigenvalues are 0 or 1.

Normal

$AA^\top=A^\top A$. Unitarily diagonalizable. Includes symmetric, orthogonal, skew-symmetric.

Spectral Theorem Every real symmetric $A$ admits $A=Q\Lambda Q^\top$ with $Q$ orthogonal ($QQ^\top=I$) and $\Lambda$ diagonal of real eigenvalues. Equivalently, $A=\sum_i \lambda_i \mathbf{q}_i\mathbf{q}_i^\top$ — a sum of rank-one PSD pieces (if $\lambda_i\ge 0$).

Tests for positive definiteness

All eigenvalues $> 0$.
All leading principal minors $> 0$ (Sylvester's criterion).
There exists non-singular $R$ with $A=R^\top R$ (e.g., Cholesky).
$\mathbf{x}^\top A \mathbf{x}>0$ for all nonzero $\mathbf{x}$.

07Singular value decomposition and the decomposition toolkit

SVD The singular value decomposition (SVD): every $A\in\mathbb{R}^{m\times n}$ admits $A=U\Sigma V^\top$ with $U\in\mathbb{R}^{m\times m}$, $V\in\mathbb{R}^{n\times n}$ orthogonal, and $\Sigma\in\mathbb{R}^{m\times n}$ diagonal with $\sigma_1\ge\sigma_2\ge\cdots\ge\sigma_r>0$ and $\sigma_{r+1}=\cdots=0$. The $\sigma_i$ are singular values; $r=\mathrm{rank}(A)$.

The SVD is the single most important factorization in machine learning (ML). It encodes geometry (any linear map is rotate → scale axes → rotate), solves rank-deficient least squares, and powers principal component analysis (PCA), recommender systems, and model compression.

Consequences

$\|A\|_2=\sigma_1$, $\|A\|_F=\sqrt{\sum\sigma_i^2}$, $\|A\|_*=\sum\sigma_i$.
Condition number $\kappa_2(A)=\sigma_1/\sigma_r$.
Eckart–Young–Mirsky: the best rank-$k$ approximation in Frobenius (or spectral) norm is $A_k=\sum_{i=1}^k \sigma_i\mathbf{u}_i\mathbf{v}_i^\top$ with error $\|A-A_k\|_F^2=\sum_{i>k}\sigma_i^2$.
Relation to eigendecomposition: $A^\top A = V\Sigma^\top\Sigma V^\top$ (columns of $V$ are eigenvectors of $A^\top A$, $\sigma_i^2$ are eigenvalues).

The decomposition menu

Factorization	Form	Use
LU	$A=LU$ ($L$ unit lower, $U$ upper)	Solving $A\mathbf{x}=\mathbf{b}$ in $O(n^3)$; forward/back-substitution afterward is $O(n^2)$.
QR	$A=QR$ ($Q$ orthogonal, $R$ upper triangular)	Least-squares (numerically stable), Gram–Schmidt, eigenvalue algorithms.
Cholesky	$A=LL^\top$ (symmetric positive-definite — SPD — $A$)	Twice as fast as LU for SPD. Log-det, sampling from $\mathcal{N}(\mathbf{0},\Sigma)$ as $L\mathbf{z}$.
Eigen	$A=V\Lambda V^{-1}$	Spectral analysis, diagonalizable dynamics, PCA on covariance.
SVD	$A=U\Sigma V^\top$	Universal. PCA, low-rank approximation, pseudoinverse, analyzing non-square/rank-deficient maps.

08Projections, least squares, pseudoinverse

Given a subspace $\mathcal{S}=\mathcal{C}(A)$ with $A$ of full column rank, the orthogonal projection onto $\mathcal{S}$ is $$P=A(A^\top A)^{-1}A^\top,$$ which satisfies $P^2=P=P^\top$. Geometrically, $P\mathbf{b}$ is the closest point in $\mathcal{S}$ to $\mathbf{b}$.

Least squares

For an overdetermined system $A\mathbf{x}\approx\mathbf{b}$ the normal equations are $$A^\top A\mathbf{x}=A^\top \mathbf{b}\quad\Longleftrightarrow\quad \mathbf{x}^* = (A^\top A)^{-1}A^\top\mathbf{b}.$$ Numerically, solve via QR: if $A=QR$, then $\mathbf{x}^*=R^{-1}Q^\top\mathbf{b}$. Squaring $A^\top A$ squares the condition number, so avoid forming it in ill-conditioned problems.

Moore–Penrose pseudoinverse

For any $A$ with SVD $A=U\Sigma V^\top$, the pseudoinverse is $$A^+ = V\Sigma^+ U^\top,\qquad \Sigma^+_{ii}=\begin{cases}1/\sigma_i & \sigma_i>0\\ 0 & \text{else}\end{cases}.$$ The minimum-norm least-squares solution to $A\mathbf{x}=\mathbf{b}$ is $\mathbf{x}^*=A^+\mathbf{b}$. When $A$ has full column rank, $A^+=(A^\top A)^{-1}A^\top$.

Ridge regression

Adding $\ell_2$ regularization yields $$\mathbf{x}_\lambda = (A^\top A + \lambda I)^{-1}A^\top\mathbf{b},$$ which always has a unique solution (the matrix is PD for $\lambda>0$) and shrinks components along small singular directions: in SVD coordinates, each coefficient is scaled by $\sigma_i/(\sigma_i^2+\lambda)$.

09Matrix calculus for machine learning

Gradients drive training. Adopt numerator layout throughout: if $f:\mathbb{R}^n\to\mathbb{R}^m$, the Jacobian $\partial f/\partial \mathbf{x}\in\mathbb{R}^{m\times n}$; the gradient of a scalar is a row vector (or its transpose, at your preference — just be consistent).

The cookbook

In the table below, $\mathbf{x}\in\mathbb{R}^n$ is the variable, $\mathbf{a},\mathbf{b}\in\mathbb{R}^n$ are constant vectors, and $A,B$ are constant matrices of compatible size (so each expression is well-posed).

Expression	Gradient w.r.t. $\mathbf{x}$ (or $A$)
$f(\mathbf{x}) = \mathbf{a}^\top\mathbf{x}$	$\mathbf{a}$
$f(\mathbf{x}) = \mathbf{x}^\top A\mathbf{x}$	$(A+A^\top)\mathbf{x}$ (= $2A\mathbf{x}$ if symmetric)
$f(\mathbf{x}) = \\|A\mathbf{x}-\mathbf{b}\\|_2^2$	$2A^\top(A\mathbf{x}-\mathbf{b})$
$f(\mathbf{x}) = \\|\mathbf{x}\\|_2$	$\mathbf{x}/\\|\mathbf{x}\\|_2$ (for $\mathbf{x}\neq\mathbf{0}$)
$f(\mathbf{x}) = \log\sum_i e^{x_i}$ (log-sum-exp)	$\mathrm{softmax}(\mathbf{x})$
$f(A) = \mathrm{tr}(A^\top B)$	$B$
$f(A) = \mathrm{tr}(AB)$	$B^\top$
$f(A) = \log\det A$ (SPD)	$A^{-\top}$
$f(A) = \det A$	$\det(A)\cdot A^{-\top}$ (Jacobi formula)
$f(A) = \\|A\\|_F^2$	$2A$

Chain rule and backprop

For $\mathbf{h}=g(\mathbf{x})$ and $\ell=f(\mathbf{h})$, the gradient $\nabla_\mathbf{x}\ell = J_g(\mathbf{x})^\top\,\nabla_\mathbf{h}\ell$. Backpropagation is just repeated application of this rule — with computation graph structure used so each Jacobian-vector product runs in cost proportional to the forward pass.

Hessian

$H=\nabla^2 f\in\mathbb{R}^{n\times n}$ is the Jacobian of the gradient. It is symmetric (for $C^2$ functions). Convexity $\iff H\succeq 0$ everywhere. Newton's method: $\mathbf{x}_{k+1}=\mathbf{x}_k - H^{-1}\nabla f$.

10ML and quant applications

The covariance matrix

Definition For a random vector $\mathbf{X}\in\mathbb{R}^d$ with mean $\boldsymbol\mu = \mathbb{E}[\mathbf{X}]$, the covariance matrix is $$\Sigma = \mathbb{E}\!\big[(\mathbf{X}-\boldsymbol\mu)(\mathbf{X}-\boldsymbol\mu)^\top\big]\in\mathbb{R}^{d\times d},\qquad \Sigma_{ij} = \mathrm{Cov}(X_i,X_j) = \mathbb{E}[(X_i-\mu_i)(X_j-\mu_j)].$$ Diagonal entries $\Sigma_{ii}=\mathrm{Var}(X_i)$; off-diagonals measure pairwise linear dependence. $\Sigma$ is symmetric, and positive semi-definite because for any $\mathbf{v}\in\mathbb{R}^d$, $\mathbf{v}^\top\Sigma\mathbf{v} = \mathrm{Var}(\mathbf{v}^\top\mathbf{X})\ge 0$.

The sample covariance from a data matrix $X\in\mathbb{R}^{n\times d}$ with centered rows is $\hat\Sigma = \tfrac{1}{n-1}X^\top X\in\mathbb{R}^{d\times d}$ — PSD always, and positive-definite generically when $n>d$. The same object powers PCA (§10), the Mahalanobis distance $\|\mathbf{x}-\boldsymbol\mu\|_\Sigma := \sqrt{(\mathbf{x}-\boldsymbol\mu)^\top\Sigma^{-1}(\mathbf{x}-\boldsymbol\mu)}$, whitening transforms $\mathbf{z} = \Sigma^{-1/2}(\mathbf{x}-\boldsymbol\mu)$, and the Gaussian log-likelihood. In finance, $\Sigma$ is the asset-return covariance; portfolio variance for weights $\mathbf{w}$ is $\mathbf{w}^\top\Sigma\mathbf{w}$.

Kernel functions

Definition Let $\mathcal{X}$ be any set (vectors in $\mathbb{R}^d$, strings, graphs, images — whatever the data lives in). A kernel is a function $$k:\mathcal{X}\times\mathcal{X}\to\mathbb{R}$$ thought of as a measure of similarity between two points. It is called symmetric if $k(\mathbf{x},\mathbf{x}')=k(\mathbf{x}',\mathbf{x})$ for all $\mathbf{x},\mathbf{x}'\in\mathcal{X}$, and positive-definite (PD) if for every finite collection $\mathbf{x}_1,\ldots,\mathbf{x}_n\in\mathcal{X}$ and every vector of scalars $\mathbf{c}=(c_1,\ldots,c_n)\in\mathbb{R}^n$, $$\sum_{i,j=1}^{n} c_i c_j\, k(\mathbf{x}_i,\mathbf{x}_j)\;\ge\;0.$$ Equivalently, the Gram matrix $K_{ij}=k(\mathbf{x}_i,\mathbf{x}_j)$ is PSD for every finite set of inputs.

Mercer / Moore–Aronszajn A function $k$ is a symmetric positive-definite kernel if and only if there exists a Hilbert space $\mathcal{H}$ (the reproducing-kernel Hilbert space, or RKHS) and a map $\phi:\mathcal{X}\to\mathcal{H}$ — the feature map — such that $$k(\mathbf{x},\mathbf{x}') \;=\; \langle\phi(\mathbf{x}),\phi(\mathbf{x}')\rangle_\mathcal{H}\qquad\forall\,\mathbf{x},\mathbf{x}'\in\mathcal{X}.$$ Every kernel is, at heart, an inner product in some feature space. The feature space may be finite- or infinite-dimensional; we never need to compute $\phi$ explicitly.

Three useful intuitions: (i) $k(\mathbf{x},\mathbf{x}')$ quantifies how alike $\mathbf{x}$ and $\mathbf{x}'$ are, with $k(\mathbf{x},\mathbf{x})=\|\phi(\mathbf{x})\|_\mathcal{H}^2$ the "self-similarity"; (ii) the PD condition says no weighted average of pairwise similarities can be negative — similarity has no "imaginary" mass; (iii) kernels let us lift non-linear problems into linear ones: running a linear method in feature space is the same as running a non-linear method in the original space, through $k$ alone.

New kernels are built from old by a small algebra — sums, positive-scalar multiples, products, and pointwise limits of PD kernels are PD; $k(\mathbf{x},\mathbf{x}') = k_1(\mathbf{x},\mathbf{x}')\cdot k_2(\mathbf{x},\mathbf{x}')$ corresponds to the tensor-product feature map. This closure is what makes compositional kernel design tractable.

Kernel Gram matrices

Definition Given a kernel $k$ and data points $\mathbf{x}_1,\ldots,\mathbf{x}_n\in\mathcal{X}$, the Gram matrix (or kernel matrix) is $$K\in\mathbb{R}^{n\times n},\qquad K_{ij}=k(\mathbf{x}_i,\mathbf{x}_j).$$ By the definition of a positive-definite kernel, $K$ is symmetric PSD for every choice of data. Its entries are the pairwise feature-space inner products $\langle\phi(\mathbf{x}_i),\phi(\mathbf{x}_j)\rangle$ — all the geometry a kernel method needs.

Common kernels:

Linear: $k(\mathbf{x},\mathbf{x}') = \mathbf{x}^\top\mathbf{x}'$ — feature map is identity.
Polynomial: $k(\mathbf{x},\mathbf{x}') = (\mathbf{x}^\top\mathbf{x}' + c)^p$ — finite-dim feature space of monomials up to degree $p$.
Gaussian / RBF: $k(\mathbf{x},\mathbf{x}') = \exp(-\|\mathbf{x}-\mathbf{x}'\|^2/(2\ell^2))$ — infinite-dimensional feature space, length-scale $\ell$.
Matérn family — RBF's roughness-controlled cousin, standard in Gaussian-process models.

The Gram matrix is the computational object that kernel methods (support vector machines, kernel ridge regression, Gaussian processes) manipulate instead of the possibly infinite-dimensional feature vectors — the "kernel trick". Training and prediction can often be written using only $K$ and its inverse, never touching $\phi(\mathbf{x})$.

Covariance ↔ Gram duality For a centered data matrix $X\in\mathbb{R}^{n\times d}$, the sample covariance $\hat\Sigma = \tfrac{1}{n-1}X^\top X$ is $d\times d$, while the linear-kernel Gram matrix $K = XX^\top$ is $n\times n$. Both are PSD, and their nonzero eigenvalues coincide: $X^\top X\,\mathbf{v} = \lambda\mathbf{v}\ \Longleftrightarrow\ XX^\top(X\mathbf{v}) = \lambda(X\mathbf{v})$. Compute whichever is smaller — the basis of kernel PCA and of left vs right formulations of the SVD.

PCA in one page

Center data: $\tilde X = X - \mathbf{1}\bar{\mathbf{x}}^\top$, where $\mathbf{1}\in\mathbb{R}^n$ denotes the all-ones column vector and $\bar{\mathbf{x}}\in\mathbb{R}^d$ is the column mean.
Compute covariance $\Sigma = \tfrac{1}{n-1}\tilde X^\top\tilde X$ (or use SVD of $\tilde X$ directly).
Eigendecompose $\Sigma=Q\Lambda Q^\top$. Columns of $Q$ are principal directions; $\lambda_i$ are variances.
Project: $Z = \tilde X Q_k$ where $Q_k$ holds the top-$k$ eigenvectors.

Equivalently, top-$k$ components come from truncated SVD of $\tilde X$. Variance explained by the first $k$ components is $\sum_{i=1}^k\lambda_i/\sum_i\lambda_i$.

Mean–variance portfolio (Markowitz)

Given expected returns $\boldsymbol\mu\in\mathbb{R}^d$, covariance $\Sigma\succ 0$, and target return $r$, minimize $\mathbf{w}^\top\Sigma\mathbf{w}$ subject to $\mathbf{1}^\top\mathbf{w}=1$ (weights sum to one) and $\boldsymbol\mu^\top\mathbf{w}=r$. Lagrangian yields a closed form where the optimal $\mathbf{w}^*$ is an affine function of $r$, with $\Sigma^{-1}\boldsymbol\mu$ and $\Sigma^{-1}\mathbf{1}$ as building blocks. With a risk-free asset of rate $r_f$, the tangency portfolio is $\propto \Sigma^{-1}(\boldsymbol\mu-r_f\mathbf{1})$.

Sampling from a Gaussian

To sample $\mathbf{x}\sim\mathcal{N}(\boldsymbol\mu,\Sigma)$ with $\Sigma=LL^\top$ (Cholesky): draw $\mathbf{z}\sim\mathcal{N}(\mathbf{0},I)$, set $\mathbf{x}=\boldsymbol\mu + L\mathbf{z}$. This underpins Monte Carlo simulation of correlated assets.

Condition number

$\kappa_2(A)=\sigma_1/\sigma_n$ controls how errors amplify when solving $A\mathbf{x}=\mathbf{b}$: $\|\Delta\mathbf{x}\|/\|\mathbf{x}\|\lesssim \kappa_2(A)\cdot\|\Delta\mathbf{b}\|/\|\mathbf{b}\|$. In deep nets, poorly conditioned Hessians explain why vanilla stochastic gradient descent (SGD) struggles without preconditioning / adaptive methods.

11Exercises · Linear algebra

Each exercise is tagged by difficulty: ★ warm-up, ★★ core, ★★★ stretch. Solutions are hidden behind a toggle — try before peeking.

Exercise 1 · Projections and least squares★

Let $\mathbf{a}=(1,1,1)^\top$ and $\mathbf{b}=(2,0,4)^\top$. Find the orthogonal projection of $\mathbf{b}$ onto $\mathrm{span}(\mathbf{a})$, and give the residual $\mathbf{r}=\mathbf{b}-P\mathbf{b}$. Verify $\mathbf{r}\perp\mathbf{a}$.

Show solution

$\mathbf{a}^\top\mathbf{a}=3$, $\mathbf{a}^\top\mathbf{b}=1\cdot 2 + 1\cdot 0+1\cdot 4=6$. The projection is $$P\mathbf{b}=\frac{\mathbf{a}^\top\mathbf{b}}{\mathbf{a}^\top\mathbf{a}}\mathbf{a}=\frac{6}{3}\mathbf{a}=(2,2,2)^\top.$$ Residual: $\mathbf{r}=\mathbf{b}-P\mathbf{b}=(0,-2,2)^\top$. Check $\mathbf{a}^\top\mathbf{r}=0-2+2=0$. ✓

Exercise 2 · Diagonalization★★

Diagonalize $A=\begin{pmatrix}4&1\\2&3\end{pmatrix}$ and compute $A^{10}\mathbf{x}_0$ for $\mathbf{x}_0=(1,0)^\top$.

Show solution

Characteristic polynomial: $\det(A-\lambda I)=(4-\lambda)(3-\lambda)-2=\lambda^2-7\lambda+10=(\lambda-5)(\lambda-2)$, so $\lambda_1=5,\lambda_2=2$.

For $\lambda=5$: $(A-5I)\mathbf{v}=\mathbf{0}$ gives $-v_1+v_2=0$, so $\mathbf{v}_1=(1,1)^\top$.
For $\lambda=2$: $2v_1+v_2=0$, so $\mathbf{v}_2=(1,-2)^\top$.

Thus $A=V\Lambda V^{-1}$ with $V=\begin{pmatrix}1&1\\1&-2\end{pmatrix}$, $V^{-1}=-\tfrac{1}{3}\begin{pmatrix}-2&-1\\-1&1\end{pmatrix}=\tfrac{1}{3}\begin{pmatrix}2&1\\1&-1\end{pmatrix}$.

Write $\mathbf{x}_0=c_1\mathbf{v}_1+c_2\mathbf{v}_2$. Solving: $c_1+c_2=1,\,c_1-2c_2=0 \Rightarrow c_1=2/3,c_2=1/3$. Therefore $$A^{10}\mathbf{x}_0 = \tfrac{2}{3}\cdot 5^{10}\mathbf{v}_1 + \tfrac{1}{3}\cdot 2^{10}\mathbf{v}_2 = \tfrac{2\cdot 5^{10}+2^{10}}{3}\binom{1}{1}\cdot\ldots$$ Concretely $A^{10}\mathbf{x}_0 = \left(\tfrac{2\cdot 5^{10}+2^{10}}{3},\,\tfrac{2\cdot 5^{10}-2\cdot 2^{10}}{3}\right)^\top$. Numerically: $5^{10}=9{,}765{,}625$, $2^{10}=1024$, giving approximately $(6{,}510{,}758,\;6{,}509{,}734)^\top$.

Exercise 3 · Positive-definite test★★

For which real $a$ is $M=\begin{pmatrix}2&a\\a&3\end{pmatrix}$ positive definite?

Show solution

Sylvester: leading minors must be positive. $M_{11}=2>0$ always. $\det M = 6-a^2 > 0 \Leftrightarrow |a|<\sqrt{6}$. So $M\succ 0$ iff $a\in(-\sqrt 6,\sqrt 6)$.

Exercise 4 · Gradient of a quadratic★

Compute $\nabla_\mathbf{x}$ of $f(\mathbf{x})=\tfrac{1}{2}\mathbf{x}^\top A\mathbf{x} - \mathbf{b}^\top\mathbf{x}$ with $A$ symmetric. Find the minimizer.

Show solution

Using the cookbook: $\nabla\left(\tfrac12 \mathbf{x}^\top A\mathbf{x}\right)=A\mathbf{x}$ (since $A$ symmetric), $\nabla(-\mathbf{b}^\top\mathbf{x})=-\mathbf{b}$. So $\nabla f = A\mathbf{x}-\mathbf{b}$. Setting this to $\mathbf{0}$: $\mathbf{x}^*=A^{-1}\mathbf{b}$, valid when $A\succ 0$ (so $f$ is strictly convex and the minimizer is unique).

Exercise 5 · SVD by hand★★

Compute the SVD of $A=\begin{pmatrix}3&0\\4&0\end{pmatrix}$.

Show solution

$A^\top A = \begin{pmatrix}25&0\\0&0\end{pmatrix}$ with eigenvalues $25,0$, so singular values are $\sigma_1=5,\sigma_2=0$. Right singular vector: $\mathbf{v}_1=(1,0)^\top$, $\mathbf{v}_2=(0,1)^\top$.

$\mathbf{u}_1=A\mathbf{v}_1/\sigma_1 = (3,4)^\top/5 = (3/5,4/5)^\top$. Complete to orthonormal basis: $\mathbf{u}_2=(-4/5,3/5)^\top$. Thus $$A=\underbrace{\begin{pmatrix}3/5&-4/5\\4/5&3/5\end{pmatrix}}_{U}\underbrace{\begin{pmatrix}5&0\\0&0\end{pmatrix}}_{\Sigma}\underbrace{\begin{pmatrix}1&0\\0&1\end{pmatrix}}_{V^\top}.$$ The matrix has rank 1, and the best rank-1 approximation is $A$ itself.

Exercise 6 · Woodbury in action★★★

Let $A\in\mathbb{R}^{n\times n}$ be invertible and $\mathbf{u},\mathbf{v}\in\mathbb{R}^n$. Derive a closed form for $(A+\mathbf{u}\mathbf{v}^\top)^{-1}$ (the Sherman–Morrison formula). When is the update invertible?

Show solution

Apply Woodbury with $U=\mathbf{u}$, $C=1$, $V=\mathbf{v}^\top$: $$(A+\mathbf{u}\mathbf{v}^\top)^{-1} = A^{-1} - \frac{A^{-1}\mathbf{u}\mathbf{v}^\top A^{-1}}{1+\mathbf{v}^\top A^{-1}\mathbf{u}}.$$ The update is invertible iff the denominator $1+\mathbf{v}^\top A^{-1}\mathbf{u}\neq 0$. This formula costs $O(n^2)$ per rank-1 update versus $O(n^3)$ for re-inversion — hence its role in Kalman filters and recursive least squares.

Exercise 7 · Ridge via SVD★★★

Let $A=U\Sigma V^\top$ be the SVD of a full-column-rank $A\in\mathbb{R}^{m\times n}$, $m\ge n$. Show that the ridge solution $\mathbf{x}_\lambda = (A^\top A + \lambda I)^{-1}A^\top\mathbf{b}$ can be written as $$\mathbf{x}_\lambda = \sum_{i=1}^n \frac{\sigma_i}{\sigma_i^2+\lambda}\,(\mathbf{u}_i^\top\mathbf{b})\,\mathbf{v}_i.$$ Interpret the effect of $\lambda$ on small vs. large singular directions.

Show solution

Substitute the SVD. $A^\top A = V\Sigma^\top\Sigma V^\top = V D V^\top$ where $D=\mathrm{diag}(\sigma_i^2)$. Then $A^\top A+\lambda I = V(D+\lambda I)V^\top$, so $(A^\top A+\lambda I)^{-1} = V (D+\lambda I)^{-1} V^\top$.

Also $A^\top\mathbf{b} = V\Sigma^\top U^\top\mathbf{b}$, with $\Sigma^\top U^\top\mathbf{b}$ the vector whose $i$th entry is $\sigma_i\,\mathbf{u}_i^\top\mathbf{b}$. Combining, $$\mathbf{x}_\lambda = V\cdot\mathrm{diag}\!\left(\frac{\sigma_i}{\sigma_i^2+\lambda}\right)\cdot U^\top \mathbf{b} = \sum_i \frac{\sigma_i}{\sigma_i^2+\lambda}(\mathbf{u}_i^\top\mathbf{b})\mathbf{v}_i.$$

The factor $\sigma_i/(\sigma_i^2+\lambda)$ approaches $1/\sigma_i$ for $\sigma_i\gg\sqrt\lambda$ (behaves like ordinary least squares — OLS) and goes to $\sigma_i/\lambda\to 0$ for $\sigma_i\ll\sqrt\lambda$: ridge damps directions where data carry little signal, which is why it tames ill-conditioning and reduces variance.

Exercise 8 · Rayleigh quotient★★

For symmetric $A$, show that $\max_{\mathbf{x}\neq\mathbf{0}} \dfrac{\mathbf{x}^\top A\mathbf{x}}{\mathbf{x}^\top\mathbf{x}} = \lambda_{\max}(A)$, attained at the corresponding unit eigenvector.

Show solution

Diagonalize $A=Q\Lambda Q^\top$. Substitute $\mathbf{y}=Q^\top\mathbf{x}$; the Rayleigh quotient becomes $\mathbf{y}^\top\Lambda\mathbf{y}/\mathbf{y}^\top\mathbf{y} = \sum \lambda_i y_i^2 / \sum y_i^2$. This is a convex combination of eigenvalues, maximized by putting all mass on the $y_i$ with the largest $\lambda_i$. So the max is $\lambda_{\max}$, achieved at $\mathbf{y}=\mathbf{e}_{i^*}$, i.e. $\mathbf{x}$ proportional to the top eigenvector. Analogously the min is $\lambda_{\min}$.

ML connection: the first PCA direction maximizes variance $\mathbf{v}^\top\Sigma\mathbf{v}/\mathbf{v}^\top\mathbf{v}$, so it's the top eigenvector of $\Sigma$.

Exercise 9 · Jacobian of softmax★★

Let $\mathbf{p}=\mathrm{softmax}(\mathbf{z})$, i.e. $p_i = e^{z_i}/\sum_j e^{z_j}$. Show that the Jacobian is $J=\mathrm{diag}(\mathbf{p})-\mathbf{p}\mathbf{p}^\top$.

Show solution

Write $S=\sum_j e^{z_j}$. Then $\partial p_i/\partial z_k = \partial/\partial z_k (e^{z_i}/S)$. For $i=k$: $p_i - p_i^2 = p_i(1-p_i)$. For $i\neq k$: $-e^{z_i}e^{z_k}/S^2 = -p_i p_k$. Compactly, $J_{ik}=p_i(\delta_{ik}-p_k)$, where $\delta_{ik}$ is the Kronecker delta ($=1$ if $i=k$, else $0$). This is $\mathrm{diag}(\mathbf{p})-\mathbf{p}\mathbf{p}^\top$. Note it is symmetric PSD and satisfies $J\mathbf{1}=\mathbf{0}$ — here $\mathbf{1}$ is the all-ones vector — so softmax is scale-invariant to additive constants.

Exercise 10 · Covariance and factor model★★★

In a one-factor quant model, asset returns satisfy $\mathbf{r} = \boldsymbol\alpha + \boldsymbol\beta f + \boldsymbol\varepsilon$ with factor variance $\sigma_f^2$, idiosyncratic noise $\boldsymbol\varepsilon\sim(\mathbf{0},D)$ where $D$ is diagonal, and $\boldsymbol\varepsilon\perp f$. Derive $\mathrm{Cov}(\mathbf{r})$ and use Sherman–Morrison to invert it.

Show solution

$\mathrm{Cov}(\mathbf{r}) = \sigma_f^2\boldsymbol\beta\boldsymbol\beta^\top + D$.

With $A=D$, $\mathbf{u}=\sigma_f\boldsymbol\beta$, $\mathbf{v}=\sigma_f\boldsymbol\beta$, Sherman–Morrison gives $$\Sigma^{-1} = D^{-1} - \frac{\sigma_f^2\,D^{-1}\boldsymbol\beta\boldsymbol\beta^\top D^{-1}}{1+\sigma_f^2\boldsymbol\beta^\top D^{-1}\boldsymbol\beta}.$$ Cost: $O(n)$ once $D^{-1}\boldsymbol\beta$ is computed. Generalizes to $k$-factor models via Woodbury, replacing $\sigma_f^2\boldsymbol\beta\boldsymbol\beta^\top$ by $B\,\Sigma_f B^\top$, where $B\in\mathbb{R}^{n\times k}$ is the loadings matrix (column $j$ gives each asset's exposure to factor $j$) and $\Sigma_f\in\mathbb{R}^{k\times k}$ is the factor-return covariance. This is exactly why risk-model inversions are cheap in production.

Part II

Differential Equations

The language of change — from gradient flows that train networks to stochastic differential equations (SDEs) that price derivatives.

12First-order ordinary differential equations

An ordinary differential equation (ODE) relates a function $y(t)$ to its derivatives. A first-order ODE has the form $y'=f(t,y)$. Under mild conditions (Picard–Lindelöf: $f$ Lipschitz in $y$), a unique local solution exists given $y(t_0)=y_0$.

Three tractable forms

Separable

$y' = g(t)h(y)$. Write $\frac{dy}{h(y)}=g(t)\,dt$ and integrate both sides.

Linear first-order

$y'+p(t)y=q(t)$. Integrating factor $\mu(t)=e^{\int p\,dt}$ gives $(\mu y)' = \mu q$, so $y=\mu^{-1}\int\mu q\,dt+C\mu^{-1}$.

Exact

$M(t,y)\,dt+N(t,y)\,dy=0$ with $\partial M/\partial y=\partial N/\partial t$. A potential $F$ exists with $F_t=M, F_y=N$; the implicit solution is $F(t,y)=C$.

Canonical examples

Exponential decay / growth: $y'=\lambda y\Rightarrow y(t)=y_0 e^{\lambda t}$.
Logistic: $y'=ry(1-y/K)\Rightarrow y(t)=K/(1+(K/y_0-1)e^{-rt})$. Models population and sigmoidal activations in continuous time.
Newton's cooling: $y'=-k(y-y_\infty)\Rightarrow y=y_\infty+(y_0-y_\infty)e^{-kt}$. The same first-order linear template appears everywhere: exponential moving averages, Ornstein–Uhlenbeck drift, resistor–capacitor (RC) circuits.

13Second-order linear ODEs

Consider $y''+py'+qy=g(t)$ with constants $p,q$. The general solution is $y=y_h+y_p$ where $y_h$ solves the homogeneous equation and $y_p$ is any particular solution.

Homogeneous case, constant coefficients

Try $y=e^{rt}$. Characteristic equation: $r^2+pr+q=0$, with roots $r_{1,2}=\tfrac{-p\pm\sqrt{p^2-4q}}{2}$.

Discriminant	Roots	General solution $y_h$
$p^2-4q>0$	$r_1,r_2$ real, distinct	$C_1 e^{r_1 t}+C_2 e^{r_2 t}$
$p^2-4q=0$	$r$ repeated	$(C_1+C_2 t)e^{rt}$
$p^2-4q<0$	$\alpha\pm i\beta$	$e^{\alpha t}(C_1\cos\beta t+C_2\sin\beta t)$

Particular solutions

Undetermined coefficients: if $g(t)$ is a polynomial, exponential, sine/cosine, or products/sums thereof, guess the same form with unknown coefficients and match.
Variation of parameters: for $y_h=C_1 y_1+C_2 y_2$, take $y_p = u_1(t)y_1 + u_2(t)y_2$ with $u_1,u_2$ solving a 2×2 system involving the Wronskian $W=y_1 y_2' - y_2 y_1'$.

Damped harmonic oscillator

$my'' + cy' + ky = 0$. With $\omega_0^2 = k/m$, $\zeta = c/(2\sqrt{mk})$:

$\zeta<1$ — underdamped: oscillation with decaying amplitude, frequency $\omega_0\sqrt{1-\zeta^2}$.
$\zeta=1$ — critically damped: fastest non-oscillatory return to equilibrium.
$\zeta>1$ — overdamped: slow exponential decay, no oscillation.

This framework classifies momentum-based optimizers (heavy-ball, Nesterov) viewed as discretizations of damped oscillators.

14Systems of linear ODEs and the matrix exponential

A linear system $\mathbf{x}'(t) = A\mathbf{x}(t)$ with $A\in\mathbb{R}^{n\times n}$ has the unique solution $\mathbf{x}(t) = e^{At}\mathbf{x}(0)$, where the matrix exponential is $$e^{At} = \sum_{k=0}^\infty \frac{(At)^k}{k!}.$$

When $A$ is diagonalizable, $A=V\Lambda V^{-1}$ and $e^{At}=V e^{\Lambda t} V^{-1}$ with $e^{\Lambda t}=\mathrm{diag}(e^{\lambda_i t})$. The solution decomposes into modes: writing $\mathbf{x}(0)=\sum c_i \mathbf{v}_i$, $$\mathbf{x}(t)=\sum_i c_i e^{\lambda_i t}\mathbf{v}_i.$$ Each eigenvalue is a mode: $\mathrm{Re}(\lambda_i)>0$ means growth, $<0$ decay, $\mathrm{Im}(\lambda_i)\neq 0$ oscillation.

Non-homogeneous systems

For $\mathbf{x}'=A\mathbf{x}+\mathbf{g}(t)$, the variation-of-parameters formula gives $$\mathbf{x}(t)=e^{At}\mathbf{x}(0)+\int_0^t e^{A(t-s)}\mathbf{g}(s)\,ds.$$

Jordan form for defective matrices

If $A$ is not diagonalizable, use Jordan form $A=PJP^{-1}$ where $J$ is block-diagonal with Jordan blocks $J_k = \lambda I + N$, $N$ nilpotent. Then $e^{J_k t}=e^{\lambda t}\sum_{j=0}^{k-1}\tfrac{(Nt)^j}{j!}$ — finite sum.

Numerics Do not compute $e^{At}$ from the series for large $\|A\|$. Use scaling and squaring with Padé approximants (the standard library implementation). Krylov methods avoid forming $e^{At}$ explicitly when only $e^{At}\mathbf{v}$ is needed.

15Stability and phase portraits

An equilibrium $\mathbf{x}^*$ of $\mathbf{x}'=\mathbf{f}(\mathbf{x})$ satisfies $\mathbf{f}(\mathbf{x}^*)=\mathbf{0}$. Linearize: $\mathbf{y}=\mathbf{x}-\mathbf{x}^*$ evolves approximately as $\mathbf{y}'=J\mathbf{y}$ where $J=\partial\mathbf{f}/\partial\mathbf{x}\big|_{\mathbf{x}^*}$.

Lyapunov / spectral criterion

Asymptotically stable if all eigenvalues of $J$ have $\mathrm{Re}(\lambda_i)<0$.
Unstable if any $\mathrm{Re}(\lambda_i)>0$.
Purely imaginary eigenvalues: linearization inconclusive (need higher-order analysis or Lyapunov functions).

2D classification

For a 2×2 Jacobian with trace $\tau$ and determinant $\Delta$, the equilibrium is:

Sign pattern	Type	Stability
$\Delta>0,\tau<0,\tau^2>4\Delta$	Stable node	Attracting
$\Delta>0,\tau<0,\tau^2<4\Delta$	Stable spiral	Attracting oscillator
$\Delta>0,\tau>0$	Unstable node/spiral	Repelling
$\Delta<0$	Saddle	Unstable (hyperbolic)
$\Delta>0,\tau=0$	Center	Neutrally stable (linear)

Lyapunov function If there exists $V:\mathbb{R}^n\to\mathbb{R}$ with $V(\mathbf{x}^*)=0$, $V(\mathbf{x})>0$ elsewhere, and $\dot V = \nabla V\cdot\mathbf{f} \le 0$ (strict on a neighborhood minus $\{\mathbf{x}^*\}$), then $\mathbf{x}^*$ is (asymptotically) stable. For gradient flow $\mathbf{x}'=-\nabla f(\mathbf{x})$, $f$ itself is a Lyapunov function: $\dot f = -\|\nabla f\|^2\le 0$.

16Partial differential equations — the essentials

Three archetypal linear second-order partial differential equations (PDEs) span most applications in physics, engineering, and quant.

Heat / diffusion equation

$u_t = \alpha u_{xx}$ where $\alpha>0$ is the diffusivity (thermal conductivity / thermal capacity). Fundamental solution on $\mathbb{R}$: $K(x,t)=\tfrac{1}{\sqrt{4\pi\alpha t}}e^{-x^2/(4\alpha t)}$ (the heat kernel). For general initial condition $u_0$, $u(x,t)=(K(\cdot,t)*u_0)(x)$ where $*$ denotes convolution. The Black–Scholes PDE reduces to the heat equation under a change of variables.

Wave equation

$u_{tt} = c^2 u_{xx}$ where $c>0$ is the propagation speed. d'Alembert: $u(x,t)=\tfrac{1}{2}[u_0(x-ct)+u_0(x+ct)] + \tfrac{1}{2c}\int_{x-ct}^{x+ct}v_0(s)\,ds$, given initial position $u_0$ and velocity $v_0$. Finite speed of propagation $c$.

Laplace / Poisson equation

$\Delta u = 0$ (Laplace) or $\Delta u = f$ (Poisson). Models steady-state diffusion, electrostatics, harmonic problems. Solutions inherit strong regularity (harmonic functions satisfy the mean-value property).

Separation of variables (template)

Assume $u(x,t)=X(x)T(t)$.
Substitute; divide to obtain $T'/T = \alpha X''/X = -\lambda$ (separation constant).
Solve the spatial eigenvalue problem with boundary conditions — this yields countably many $\lambda_n$ and modes $X_n$.
Superpose: $u(x,t)=\sum_n c_n T_n(t) X_n(x)$, with $c_n$ fixed by initial conditions (Fourier coefficients).

17Stochastic differential equations (quant)

A Brownian motion (or Wiener process) $W_t$ is a continuous stochastic process with $W_0=0$, independent increments, and $W_t - W_s \sim \mathcal{N}(0,t-s)$. Its paths are continuous but nowhere differentiable; informally, $dW_t \sim \mathcal{N}(0,dt)$.

An Itô SDE has the form $$dX_t = \mu(X_t,t)\,dt + \sigma(X_t,t)\,dW_t,$$ with drift $\mu$ and diffusion $\sigma$. Existence and uniqueness hold under Lipschitz and linear-growth conditions on $\mu,\sigma$.

Itô's lemma (the chain rule)

For a function $f(x,t)$ that is twice continuously differentiable in $x$ and once in $t$ (written $f\in C^{2,1}$), and $X_t$ as above, $$df(X_t,t) = \left(f_t + \mu f_x + \tfrac{1}{2}\sigma^2 f_{xx}\right) dt + \sigma f_x\, dW_t.$$ The extra $\tfrac{1}{2}\sigma^2 f_{xx}$ term is the hallmark of Itô calculus; it arises because $(dW_t)^2 = dt$ in the mean-square sense.

Geometric Brownian motion

Geometric Brownian motion (GBM): $dS_t = \mu S_t\,dt + \sigma S_t\,dW_t$. Apply Itô to $\log S_t$: $$d(\log S_t) = (\mu - \tfrac{1}{2}\sigma^2)\,dt + \sigma\, dW_t.$$ Hence $S_t = S_0 \exp\big((\mu-\tfrac{1}{2}\sigma^2)t + \sigma W_t\big)$, lognormally distributed. This is the Black–Scholes stock model.

Black–Scholes PDE

Under the risk-neutral measure (with constant risk-free rate $r$), the price $V(S,t)$ of a European option written on an asset $S_t$ following $dS=rS\,dt+\sigma S\,dW$ satisfies $$V_t + rS V_S + \tfrac{1}{2}\sigma^2 S^2 V_{SS} - rV = 0,$$ with terminal condition $V(S,T)=\mathrm{payoff}(S)$ at maturity $T$. For a European call, the closed-form solution is the celebrated Black–Scholes formula.

Ornstein–Uhlenbeck

Ornstein–Uhlenbeck (OU) process: $dX_t = \theta(\bar X - X_t)\,dt + \sigma\,dW_t$, where $\theta>0$ is the mean-reversion speed, $\bar X$ is the long-run mean, and $\sigma>0$ is the diffusion amplitude. Mean-reverting; stationary distribution $\mathcal{N}(\bar X,\sigma^2/(2\theta))$. Used for interest-rate models (Vasicek) and noise schedules in diffusion generative models.

18Numerical methods for ODEs and SDEs

ODE integrators

We approximate the solution of $\mathbf{x}'=\mathbf{f}(\mathbf{x},t)$ on a grid $t_0

Explicit Euler

$\mathbf{x}_{n+1}=\mathbf{x}_n + h\mathbf{f}(\mathbf{x}_n,t_n)$. Order 1. Simple, cheap, poor stability for stiff problems.

Implicit Euler

$\mathbf{x}_{n+1}=\mathbf{x}_n + h\mathbf{f}(\mathbf{x}_{n+1},t_{n+1})$. Order 1 but $A$-stable. Requires solving a (possibly nonlinear) system per step.

Midpoint / RK2

Evaluate $\mathbf{f}$ at the midpoint of a trial Euler step. Second-order Runge–Kutta (RK2). Order 2.

RK4

Classical 4-stage Runge–Kutta (RK4). Order 4. The workhorse for non-stiff, moderate-accuracy problems.

Adaptive (Dormand–Prince)

Embedded Runge–Kutta pair — RK4(5) — with automatic step-size control. Used in scipy.integrate.solve_ivp.

Symplectic (Verlet, leapfrog)

Preserve energy for Hamiltonian systems over long integrations. Backbone of Hamiltonian Monte Carlo (HMC) samplers.

Stability and stiffness

A method is $A$-stable if for the test equation $y'=\lambda y$ with $\mathrm{Re}(\lambda)<0$, $y_n\to 0$ for any step $h>0$. Explicit methods have bounded stability regions; stiff problems (wide spread of time scales) demand implicit methods or exponential integrators.

SDE schemes

Euler–Maruyama: $X_{n+1}=X_n+\mu\,h + \sigma\,\Delta W_n$ with the Brownian increment $\Delta W_n = W_{t_{n+1}} - W_{t_n} \sim\mathcal{N}(0,h)$. Strong order $1/2$, weak order $1$.
Milstein: adds an Itô correction term $\tfrac{1}{2}\sigma\sigma_x(\Delta W_n^2-h)$, where $\sigma_x = \partial\sigma/\partial x$, for strong order $1$.
For GBM, simulate $\log S$ directly via the exact solution to avoid discretization bias.

19ML and quant applications of differential equations

Gradient flow

The continuous-time limit of gradient descent is $\dot{\mathbf{x}} = -\nabla f(\mathbf{x})$. For a quadratic $f(\mathbf{x})=\tfrac{1}{2}\mathbf{x}^\top A\mathbf{x}$ with $A\succ 0$, $\mathbf{x}(t) = e^{-At}\mathbf{x}(0)$. Decay along eigendirection $\mathbf{v}_i$ has rate $\lambda_i$: small eigenvalues give slow modes, which is why preconditioning (dividing by $A$) flattens the convergence landscape.

Momentum as a damped oscillator

Heavy-ball descent, in the continuous limit, becomes $\ddot{\mathbf{x}} + \gamma \dot{\mathbf{x}} + \nabla f(\mathbf{x}) = 0$, where $\gamma>0$ is the damping coefficient. Optimal tuning critically damps: $\gamma = 2\sqrt{\mu}$ for a $\mu$-strongly convex $f$ (i.e., $\nabla^2 f \succeq \mu I$ with strong-convexity constant $\mu>0$). This gives the accelerated $O(\sqrt{\kappa}\log(1/\varepsilon))$ rate of Nesterov's method, where $\kappa = L/\mu$ is the condition number ($L$ being the Lipschitz constant of $\nabla f$) and $\varepsilon$ is the target accuracy.

Neural ODEs

A residual block $\mathbf{h}_{n+1}=\mathbf{h}_n + f_\theta(\mathbf{h}_n)$ is one Euler step of $\dot{\mathbf{h}} = f_\theta(\mathbf{h},t)$. Neural ODEs replace finite depth with the flow of an ODE solved by an adaptive integrator, with gradients via the adjoint method (itself an ODE).

Diffusion generative models

The forward noising process is an SDE $d\mathbf{x}_t = -\tfrac{1}{2}\beta_t\mathbf{x}_t\,dt + \sqrt{\beta_t}\,d\mathbf{W}_t$ (an OU with time-varying coefficient), where $\beta_t>0$ is a prescribed noise schedule controlling how fast data is corrupted toward pure noise, and $\mathbf{W}_t$ is a standard multidimensional Brownian motion. Let $p_t(\mathbf{x})$ denote the marginal density of $\mathbf{x}_t$. The reverse-time SDE (Anderson, 1982) exists and has the form $d\mathbf{x}_t = [-\tfrac{1}{2}\beta_t\mathbf{x}_t - \beta_t\nabla_{\mathbf{x}}\log p_t(\mathbf{x})]\,dt + \sqrt{\beta_t}\,d\bar{\mathbf{W}}_t$, where $\bar{\mathbf{W}}_t$ is a Brownian motion running backward in time. The quantity $\nabla_{\mathbf{x}}\log p_t(\mathbf{x})$ is called the score. Training a score network to approximate it then lets us integrate the reverse SDE (or the deterministic probability-flow ODE with the same marginals) to generate samples.

Quant: term structure and risk

Vasicek model: $dr_t = a(b-r_t)\,dt + \sigma\,dW_t$ (OU for short rate).
CIR (Cox–Ingersoll–Ross): $dr_t = a(b-r_t)\,dt + \sigma\sqrt{r_t}\,dW_t$ (non-negative short rate).
Heston: stochastic volatility model with a CIR variance process; yields a 2D PDE for option prices.
Feynman–Kac links PDE pricing and expectations under risk-neutral measure: $V(x,t) = \mathbb{E}[\text{discounted payoff} \mid X_t=x]$ solves the corresponding backward PDE.

20Exercises · Differential equations

Exercise 1 · Separable ODE★

Solve $y' = y(1-y)$, $y(0)=1/2$.

Show solution

Separate: $\dfrac{dy}{y(1-y)} = dt$. Partial fractions $\dfrac{1}{y(1-y)} = \dfrac{1}{y}+\dfrac{1}{1-y}$. Integrate: $\ln|y|-\ln|1-y| = t+C$, so $y/(1-y)=Ae^t$. Initial condition: $A=1$. Hence $y(t)=\dfrac{e^t}{1+e^t}=\sigma(t)$ — the standard sigmoid. This is the continuous-time logistic and shows up as the generalized-linear-model (GLM) link function.

Exercise 2 · Linear first-order★

Solve $y' + 2y = e^{-t}$, $y(0)=0$.

Show solution

Integrating factor $\mu = e^{2t}$. Then $(e^{2t}y)' = e^{2t}\cdot e^{-t} = e^t$, so $e^{2t}y = e^t + C \Rightarrow y = e^{-t} + Ce^{-2t}$. Using $y(0)=0$: $1+C=0\Rightarrow C=-1$. Therefore $y(t) = e^{-t}-e^{-2t}$.

Exercise 3 · Damped oscillator★★

Solve $y'' + 4y' + 5y = 0$, $y(0)=1, y'(0)=0$. Classify the damping.

Show solution

Characteristic: $r^2+4r+5=0 \Rightarrow r=-2\pm i$. Underdamped ($\zeta = 4/(2\sqrt 5) = 2/\sqrt 5 < 1$). General solution $y = e^{-2t}(C_1\cos t + C_2 \sin t)$.

$y(0)=C_1=1$. $y'(0) = -2C_1 + C_2 = 0 \Rightarrow C_2 = 2$. So $y(t) = e^{-2t}(\cos t + 2\sin t)$.

Exercise 4 · Matrix exponential★★

Solve $\mathbf{x}'=A\mathbf{x}$ with $A=\begin{pmatrix}0&-1\\1&0\end{pmatrix}$, $\mathbf{x}(0)=(1,0)^\top$. What geometric operation does $e^{At}$ perform?

Show solution

Eigenvalues $\pm i$ with eigenvectors $(1,\mp i)^\top$. Since $A^2=-I$, the Taylor series splits into even/odd parts: $$e^{At} = I\cos t + A\sin t = \begin{pmatrix}\cos t & -\sin t\\ \sin t & \cos t\end{pmatrix}.$$ So $e^{At}$ rotates vectors counterclockwise by angle $t$. With initial condition $(1,0)^\top$: $\mathbf{x}(t)=(\cos t,\sin t)^\top$.

Exercise 5 · Phase portrait and stability★★

Classify the equilibrium of $\mathbf{x}'=A\mathbf{x}$ with $A=\begin{pmatrix}-3&1\\1&-3\end{pmatrix}$. Sketch the invariant directions.

Show solution

Trace $\tau=-6$, determinant $\Delta=9-1=8$. $\tau^2-4\Delta = 36-32 = 4 > 0$, $\tau<0$, $\Delta>0$ → stable node.

Eigenvalues: $\lambda = -3\pm 1 \in\{-2,-4\}$. Eigenvectors: for $-2$, $(1,1)^\top$; for $-4$, $(1,-1)^\top$. Trajectories approach the origin tangent to the slow direction $(1,1)^\top$ (corresponding to the larger $\lambda=-2$).

Exercise 6 · Gradient flow for quadratic★★

Consider gradient flow $\dot{\mathbf{x}} = -A\mathbf{x}$ on $f(\mathbf{x})=\tfrac{1}{2}\mathbf{x}^\top A\mathbf{x}$ with $A=\mathrm{diag}(1,100)$. Find $\mathbf{x}(t)$ for $\mathbf{x}(0)=(1,1)^\top$ and discuss the condition-number effect.

Show solution

$\mathbf{x}(t) = e^{-At}\mathbf{x}(0) = (e^{-t}, e^{-100t})^\top$. The second coordinate decays $100\times$ faster than the first. After $t$ large enough to kill the fast mode, the residual error is dominated by the slow mode — convergence of the loss is governed by $\lambda_{\min}(A)=1$, and the condition number $\kappa(A)=100$ measures how much worse this is than the isotropic case. In discrete time, steepest descent with step $\eta$ converges as $(1-\eta\lambda_{\min})^k$ while requiring $\eta\le 2/\lambda_{\max}$, hence the well-known $(\kappa-1)/(\kappa+1)$ contraction.

Exercise 7 · Itô's lemma★★

Let $S_t$ follow GBM $dS_t = \mu S_t\,dt + \sigma S_t\,dW_t$. Derive the SDE satisfied by $Y_t = S_t^2$.

Show solution

With $f(x)=x^2$: $f_x = 2x$, $f_{xx}=2$. Itô: $dY_t = (2S_t\cdot\mu S_t + \tfrac{1}{2}\cdot 2\cdot \sigma^2 S_t^2)\,dt + 2S_t\cdot \sigma S_t\,dW_t$, i.e. $$dY_t = (2\mu + \sigma^2)Y_t\,dt + 2\sigma Y_t\,dW_t.$$ So $Y_t$ is itself GBM with drift $2\mu+\sigma^2$ and volatility $2\sigma$. Note the extra $\sigma^2$ from the Itô correction.

Exercise 8 · Ornstein–Uhlenbeck stationary law★★★

For $dX_t = -\theta X_t\,dt + \sigma\,dW_t$, $X_0 = x_0$, find the mean and variance of $X_t$ and the stationary distribution.

Show solution

This is linear. Use variation of parameters: $X_t = e^{-\theta t}x_0 + \sigma\int_0^t e^{-\theta(t-s)}\,dW_s$.

The Itô integral is a mean-zero Gaussian. So $\mathbb{E}[X_t] = e^{-\theta t}x_0$ and, by the Itô isometry — which states $\mathbb{E}\big[(\int_0^t g(s)\,dW_s)^2\big] = \int_0^t g(s)^2\,ds$ for deterministic $g$ — $$\mathrm{Var}(X_t) = \sigma^2\int_0^t e^{-2\theta(t-s)}\,ds = \frac{\sigma^2}{2\theta}(1 - e^{-2\theta t}).$$

As $t\to\infty$ (and $\theta>0$), $X_t \Rightarrow \mathcal{N}(0,\sigma^2/(2\theta))$ — the stationary distribution. This is the noise law used in Langevin samplers and score-based generative models.

Exercise 9 · Black–Scholes as heat equation★★★

Show that the change of variables $x=\log S$, $\tau=T-t$, $u = e^{r\tau}V$ reduces Black–Scholes to a (backward) diffusion equation. State the resulting PDE.

Show solution

Start from $V_t + rSV_S + \tfrac{1}{2}\sigma^2 S^2 V_{SS} - rV = 0$. With $x=\log S$: $V_S = V_x/S$, $V_{SS} = (V_{xx}-V_x)/S^2$. Substituting, $$V_t + rV_x + \tfrac{1}{2}\sigma^2(V_{xx}-V_x) - rV = 0 \Rightarrow V_t + (r-\tfrac{1}{2}\sigma^2)V_x + \tfrac{1}{2}\sigma^2 V_{xx} - rV = 0.$$ Now set $\tau = T - t$ (so $V_t = -V_\tau$) and $u = e^{r\tau}V$ (so $V = e^{-r\tau}u$, $V_t = -V_\tau = e^{-r\tau}(ru - u_\tau)$). After simplification, $$u_\tau = (r-\tfrac{1}{2}\sigma^2)u_x + \tfrac{1}{2}\sigma^2 u_{xx}.$$ A further shift $y = x + (r-\tfrac{1}{2}\sigma^2)\tau$ kills the drift and leaves the canonical heat equation $u_\tau = \tfrac{1}{2}\sigma^2 u_{yy}$, which is solved by Gaussian convolution with the terminal payoff — yielding the Black–Scholes formula.

Exercise 10 · Lyapunov stability★★★

Show that the origin is globally asymptotically stable for $\dot{\mathbf{x}} = -\nabla f(\mathbf{x})$ when $f$ is $\mu$-strongly convex with $\nabla f(\mathbf{0})=\mathbf{0}$, and give an exponential convergence rate.

Show solution

Take $V(\mathbf{x}) = f(\mathbf{x}) - f(\mathbf{0})\ge 0$, with $V(\mathbf{0})=0$ and $V(\mathbf{x})\ge \tfrac{\mu}{2}\|\mathbf{x}\|^2$ by strong convexity.

Along trajectories $\dot V = \nabla f\cdot \dot{\mathbf{x}} = -\|\nabla f\|^2\le 0$. By the Polyak–Łojasiewicz (PL) inequality — which relates gradient magnitude to suboptimality and is implied here by $\mu$-strong convexity — $$\|\nabla f(\mathbf{x})\|^2\ge 2\mu\big(f(\mathbf{x})-f^*\big)=2\mu\, V(\mathbf{x}),$$ where $f^* = \min_\mathbf{x} f(\mathbf{x}) = f(\mathbf{0})$ is the optimal value.

Therefore $\dot V \le -2\mu V$, which gives $V(\mathbf{x}(t))\le V(\mathbf{x}(0))e^{-2\mu t}$ by Grönwall's inequality (integrating the differential inequality $\dot V \le -2\mu V$). Combining with the strong-convexity lower bound $V(\mathbf{x})\ge\tfrac{\mu}{2}\|\mathbf{x}\|^2$, one gets $\|\mathbf{x}(t)\|\le C e^{-\mu t}$ for some constant $C>0$: linear (exponential) convergence. This is the continuous-time analogue of the well-known discrete linear rate for gradient descent under strong convexity.