The mathematical core every machine-learning and quantitative researcher returns to — distilled, annotated, and paired with worked exercises.
The grammar of machine learning — where every model is a map between vector spaces.
A vector $\mathbf{x}\in\mathbb{R}^n$ is an ordered tuple $\mathbf{x}=(x_1,\dots,x_n)^\top$. Vectors live in a vector space: a set closed under addition and scalar multiplication with the usual axioms. In ML, data points, gradients, and parameters are all vectors; in quant finance, asset returns, portfolio weights, and yield-curve shifts are vectors.
The Euclidean inner product of $\mathbf{x},\mathbf{y}\in\mathbb{R}^n$ is $$\langle\mathbf{x},\mathbf{y}\rangle \;=\; \mathbf{x}^\top\mathbf{y} \;=\; \sum_{i=1}^n x_i y_i \;=\; \|\mathbf{x}\|\,\|\mathbf{y}\|\cos\theta.$$ It measures alignment: orthogonality $\iff \langle\mathbf{x},\mathbf{y}\rangle=0$. The Cauchy–Schwarz inequality states $|\langle\mathbf{x},\mathbf{y}\rangle|\le\|\mathbf{x}\|\,\|\mathbf{y}\|$.
$\|\mathbf{x}\|_2 = \sqrt{\sum_i x_i^2}$. The default in regression, kernels, optimization.
$\|\mathbf{x}\|_1 = \sum_i |x_i|$. Induces sparsity (Lasso).
$\|\mathbf{x}\|_\infty = \max_i |x_i|$. Worst-case / minimax bounds.
$\|\mathbf{x}\|_p = \left(\sum_i |x_i|^p\right)^{1/p}$, convex for $p\ge 1$.
A matrix $A\in\mathbb{R}^{m\times n}$ represents a linear map $\mathbb{R}^n\to\mathbb{R}^m$ via $\mathbf{x}\mapsto A\mathbf{x}$. Column $j$ of $A$ is the image of $\mathbf{e}_j$, the $j$th standard basis vector. Matrix multiplication $AB$ composes maps; it is associative, distributive, but not commutative.
Transpose: $(A^\top)_{ij}=A_{ji}$. Properties: $(AB)^\top=B^\top A^\top$, $(A^\top)^{-1}=(A^{-1})^\top$.
Inverse: $A^{-1}$ exists iff $A$ is square and $\det A\neq 0$. Then $AA^{-1}=A^{-1}A=I$. For $2\times 2$: $\begin{pmatrix}a&b\\c&d\end{pmatrix}^{-1}=\dfrac{1}{ad-bc}\begin{pmatrix}d&-b\\-c&a\end{pmatrix}$.
Block matrices multiply blockwise when dimensions conform. A 2×2 block-diagonal inverse is $\mathrm{diag}(A,B)^{-1}=\mathrm{diag}(A^{-1},B^{-1})$.
Schur complement: if $M=\begin{pmatrix}A&B\\C&D\end{pmatrix}$ with $A$ invertible, the Schur complement is $M/A = D-CA^{-1}B$, and $\det M=\det A\cdot\det(M/A)$. This appears in Gaussian conditionals and in LDL factorizations.
Woodbury identity (low-rank updates, central to Kalman filtering and kernel methods): $$(A+UCV)^{-1}=A^{-1}-A^{-1}U(C^{-1}+VA^{-1}U)^{-1}VA^{-1}.$$
| Norm | Definition | Use |
|---|---|---|
| Frobenius | $\|A\|_F=\sqrt{\sum_{ij}A_{ij}^2}=\sqrt{\mathrm{tr}(A^\top A)}$ | Default "size" of a matrix; matches $\ell_2$ on $\mathrm{vec}(A)$. |
| Spectral ($\ell_2$-op) | $\|A\|_2 = \sigma_{\max}(A) = \sup_{\|\mathbf{x}\|=1}\|A\mathbf{x}\|$ | Lipschitz constant of the linear map; controls stability. |
| Nuclear | $\|A\|_* = \sum_i \sigma_i(A)$ | Convex surrogate of rank (matrix completion). |
| $\ell_1$/$\ell_\infty$ induced | Max col-sum / max row-sum of $|A|$ | Easy bounds; used in numerical analysis. |
The span of vectors $\{\mathbf{v}_1,\dots,\mathbf{v}_k\}$ is the set of all their linear combinations. A set is linearly independent if $\sum c_i\mathbf{v}_i=\mathbf{0}\implies c_i=0$ for all $i$. A basis of a subspace is a linearly independent spanning set; its size is the dimension.
For $A\in\mathbb{R}^{m\times n}$, Strang's four subspaces are:
Span of columns. $A\mathbf{x}=\mathbf{b}$ is solvable $\iff \mathbf{b}\in\mathcal{C}(A)$. Dimension $= r$ (rank).
$\{\mathbf{x}:A\mathbf{x}=\mathbf{0}\}$. Dimension $n-r$. Orthogonal complement of $\mathcal{C}(A^\top)$.
Span of rows. Dimension $= r$. Orthogonal to $\mathcal{N}(A)$.
$\{\mathbf{y}:A^\top\mathbf{y}=\mathbf{0}\}$. Dimension $m-r$. Orthogonal to $\mathcal{C}(A)$.
Rank is preserved under invertible maps: $\mathrm{rank}(PAQ)=\mathrm{rank}(A)$ for invertible $P,Q$. Also $\mathrm{rank}(AB)\le\min(\mathrm{rank}(A),\mathrm{rank}(B))$ and $\mathrm{rank}(A^\top A)=\mathrm{rank}(A)$.
The determinant $\det A$ of a square matrix is the signed $n$-dimensional volume of the parallelepiped spanned by its columns. Properties:
The trace is $\mathrm{tr}(A)=\sum_i A_{ii}=\sum_i \lambda_i$. Key identities:
If $A$ has $n$ linearly independent eigenvectors, it is diagonalizable: $$A=V\Lambda V^{-1},\qquad \Lambda=\mathrm{diag}(\lambda_1,\dots,\lambda_n),\quad V=[\mathbf{v}_1\,\cdots\,\mathbf{v}_n].$$ Then $A^k=V\Lambda^k V^{-1}$ and $f(A)=V f(\Lambda) V^{-1}$ for analytic $f$. This is why eigendecomposition accelerates matrix powers (Markov chains, PageRank, recurrent neural network — RNN — dynamics) and matrix functions (matrix exponentials).
$A=A^\top$. Real eigenvalues; orthogonal eigenvectors. Diagonalizable as $A=Q\Lambda Q^\top$ with $Q$ orthogonal.
$Q^\top Q = I$. Preserves norms and angles; $\det Q=\pm1$; eigenvalues lie on the unit circle. Rotations/reflections.
Symmetric with $\mathbf{x}^\top A\mathbf{x}\ge 0\;\forall \mathbf{x}$. Eigenvalues $\lambda_i\ge 0$. Covariance matrices and kernel Gram matrices are PSD.
$\mathbf{x}^\top A\mathbf{x}>0\;\forall\mathbf{x}\neq\mathbf{0}$. All $\lambda_i>0$. Admits a unique Cholesky factorization $A=LL^\top$.
$P^2=P$. Orthogonal if additionally $P=P^\top$; then $P=QQ^\top$ for orthonormal $Q$. Eigenvalues are 0 or 1.
$AA^\top=A^\top A$. Unitarily diagonalizable. Includes symmetric, orthogonal, skew-symmetric.
The SVD is the single most important factorization in machine learning (ML). It encodes geometry (any linear map is rotate → scale axes → rotate), solves rank-deficient least squares, and powers principal component analysis (PCA), recommender systems, and model compression.
| Factorization | Form | Use |
|---|---|---|
| LU | $A=LU$ ($L$ unit lower, $U$ upper) | Solving $A\mathbf{x}=\mathbf{b}$ in $O(n^3)$; forward/back-substitution afterward is $O(n^2)$. |
| QR | $A=QR$ ($Q$ orthogonal, $R$ upper triangular) | Least-squares (numerically stable), Gram–Schmidt, eigenvalue algorithms. |
| Cholesky | $A=LL^\top$ (symmetric positive-definite — SPD — $A$) | Twice as fast as LU for SPD. Log-det, sampling from $\mathcal{N}(\mathbf{0},\Sigma)$ as $L\mathbf{z}$. |
| Eigen | $A=V\Lambda V^{-1}$ | Spectral analysis, diagonalizable dynamics, PCA on covariance. |
| SVD | $A=U\Sigma V^\top$ | Universal. PCA, low-rank approximation, pseudoinverse, analyzing non-square/rank-deficient maps. |
Given a subspace $\mathcal{S}=\mathcal{C}(A)$ with $A$ of full column rank, the orthogonal projection onto $\mathcal{S}$ is $$P=A(A^\top A)^{-1}A^\top,$$ which satisfies $P^2=P=P^\top$. Geometrically, $P\mathbf{b}$ is the closest point in $\mathcal{S}$ to $\mathbf{b}$.
For an overdetermined system $A\mathbf{x}\approx\mathbf{b}$ the normal equations are $$A^\top A\mathbf{x}=A^\top \mathbf{b}\quad\Longleftrightarrow\quad \mathbf{x}^* = (A^\top A)^{-1}A^\top\mathbf{b}.$$ Numerically, solve via QR: if $A=QR$, then $\mathbf{x}^*=R^{-1}Q^\top\mathbf{b}$. Squaring $A^\top A$ squares the condition number, so avoid forming it in ill-conditioned problems.
For any $A$ with SVD $A=U\Sigma V^\top$, the pseudoinverse is $$A^+ = V\Sigma^+ U^\top,\qquad \Sigma^+_{ii}=\begin{cases}1/\sigma_i & \sigma_i>0\\ 0 & \text{else}\end{cases}.$$ The minimum-norm least-squares solution to $A\mathbf{x}=\mathbf{b}$ is $\mathbf{x}^*=A^+\mathbf{b}$. When $A$ has full column rank, $A^+=(A^\top A)^{-1}A^\top$.
Adding $\ell_2$ regularization yields $$\mathbf{x}_\lambda = (A^\top A + \lambda I)^{-1}A^\top\mathbf{b},$$ which always has a unique solution (the matrix is PD for $\lambda>0$) and shrinks components along small singular directions: in SVD coordinates, each coefficient is scaled by $\sigma_i/(\sigma_i^2+\lambda)$.
Gradients drive training. Adopt numerator layout throughout: if $f:\mathbb{R}^n\to\mathbb{R}^m$, the Jacobian $\partial f/\partial \mathbf{x}\in\mathbb{R}^{m\times n}$; the gradient of a scalar is a row vector (or its transpose, at your preference — just be consistent).
In the table below, $\mathbf{x}\in\mathbb{R}^n$ is the variable, $\mathbf{a},\mathbf{b}\in\mathbb{R}^n$ are constant vectors, and $A,B$ are constant matrices of compatible size (so each expression is well-posed).
| Expression | Gradient w.r.t. $\mathbf{x}$ (or $A$) |
|---|---|
| $f(\mathbf{x}) = \mathbf{a}^\top\mathbf{x}$ | $\mathbf{a}$ |
| $f(\mathbf{x}) = \mathbf{x}^\top A\mathbf{x}$ | $(A+A^\top)\mathbf{x}$ (= $2A\mathbf{x}$ if symmetric) |
| $f(\mathbf{x}) = \|A\mathbf{x}-\mathbf{b}\|_2^2$ | $2A^\top(A\mathbf{x}-\mathbf{b})$ |
| $f(\mathbf{x}) = \|\mathbf{x}\|_2$ | $\mathbf{x}/\|\mathbf{x}\|_2$ (for $\mathbf{x}\neq\mathbf{0}$) |
| $f(\mathbf{x}) = \log\sum_i e^{x_i}$ (log-sum-exp) | $\mathrm{softmax}(\mathbf{x})$ |
| $f(A) = \mathrm{tr}(A^\top B)$ | $B$ |
| $f(A) = \mathrm{tr}(AB)$ | $B^\top$ |
| $f(A) = \log\det A$ (SPD) | $A^{-\top}$ |
| $f(A) = \det A$ | $\det(A)\cdot A^{-\top}$ (Jacobi formula) |
| $f(A) = \|A\|_F^2$ | $2A$ |
For $\mathbf{h}=g(\mathbf{x})$ and $\ell=f(\mathbf{h})$, the gradient $\nabla_\mathbf{x}\ell = J_g(\mathbf{x})^\top\,\nabla_\mathbf{h}\ell$. Backpropagation is just repeated application of this rule — with computation graph structure used so each Jacobian-vector product runs in cost proportional to the forward pass.
$H=\nabla^2 f\in\mathbb{R}^{n\times n}$ is the Jacobian of the gradient. It is symmetric (for $C^2$ functions). Convexity $\iff H\succeq 0$ everywhere. Newton's method: $\mathbf{x}_{k+1}=\mathbf{x}_k - H^{-1}\nabla f$.
The sample covariance from a data matrix $X\in\mathbb{R}^{n\times d}$ with centered rows is $\hat\Sigma = \tfrac{1}{n-1}X^\top X\in\mathbb{R}^{d\times d}$ — PSD always, and positive-definite generically when $n>d$. The same object powers PCA (§10), the Mahalanobis distance $\|\mathbf{x}-\boldsymbol\mu\|_\Sigma := \sqrt{(\mathbf{x}-\boldsymbol\mu)^\top\Sigma^{-1}(\mathbf{x}-\boldsymbol\mu)}$, whitening transforms $\mathbf{z} = \Sigma^{-1/2}(\mathbf{x}-\boldsymbol\mu)$, and the Gaussian log-likelihood. In finance, $\Sigma$ is the asset-return covariance; portfolio variance for weights $\mathbf{w}$ is $\mathbf{w}^\top\Sigma\mathbf{w}$.
Three useful intuitions: (i) $k(\mathbf{x},\mathbf{x}')$ quantifies how alike $\mathbf{x}$ and $\mathbf{x}'$ are, with $k(\mathbf{x},\mathbf{x})=\|\phi(\mathbf{x})\|_\mathcal{H}^2$ the "self-similarity"; (ii) the PD condition says no weighted average of pairwise similarities can be negative — similarity has no "imaginary" mass; (iii) kernels let us lift non-linear problems into linear ones: running a linear method in feature space is the same as running a non-linear method in the original space, through $k$ alone.
New kernels are built from old by a small algebra — sums, positive-scalar multiples, products, and pointwise limits of PD kernels are PD; $k(\mathbf{x},\mathbf{x}') = k_1(\mathbf{x},\mathbf{x}')\cdot k_2(\mathbf{x},\mathbf{x}')$ corresponds to the tensor-product feature map. This closure is what makes compositional kernel design tractable.
Common kernels:
The Gram matrix is the computational object that kernel methods (support vector machines, kernel ridge regression, Gaussian processes) manipulate instead of the possibly infinite-dimensional feature vectors — the "kernel trick". Training and prediction can often be written using only $K$ and its inverse, never touching $\phi(\mathbf{x})$.
Equivalently, top-$k$ components come from truncated SVD of $\tilde X$. Variance explained by the first $k$ components is $\sum_{i=1}^k\lambda_i/\sum_i\lambda_i$.
Given expected returns $\boldsymbol\mu\in\mathbb{R}^d$, covariance $\Sigma\succ 0$, and target return $r$, minimize $\mathbf{w}^\top\Sigma\mathbf{w}$ subject to $\mathbf{1}^\top\mathbf{w}=1$ (weights sum to one) and $\boldsymbol\mu^\top\mathbf{w}=r$. Lagrangian yields a closed form where the optimal $\mathbf{w}^*$ is an affine function of $r$, with $\Sigma^{-1}\boldsymbol\mu$ and $\Sigma^{-1}\mathbf{1}$ as building blocks. With a risk-free asset of rate $r_f$, the tangency portfolio is $\propto \Sigma^{-1}(\boldsymbol\mu-r_f\mathbf{1})$.
To sample $\mathbf{x}\sim\mathcal{N}(\boldsymbol\mu,\Sigma)$ with $\Sigma=LL^\top$ (Cholesky): draw $\mathbf{z}\sim\mathcal{N}(\mathbf{0},I)$, set $\mathbf{x}=\boldsymbol\mu + L\mathbf{z}$. This underpins Monte Carlo simulation of correlated assets.
$\kappa_2(A)=\sigma_1/\sigma_n$ controls how errors amplify when solving $A\mathbf{x}=\mathbf{b}$: $\|\Delta\mathbf{x}\|/\|\mathbf{x}\|\lesssim \kappa_2(A)\cdot\|\Delta\mathbf{b}\|/\|\mathbf{b}\|$. In deep nets, poorly conditioned Hessians explain why vanilla stochastic gradient descent (SGD) struggles without preconditioning / adaptive methods.
Each exercise is tagged by difficulty: ★ warm-up, ★★ core, ★★★ stretch. Solutions are hidden behind a toggle — try before peeking.
$\mathbf{a}^\top\mathbf{a}=3$, $\mathbf{a}^\top\mathbf{b}=1\cdot 2 + 1\cdot 0+1\cdot 4=6$. The projection is $$P\mathbf{b}=\frac{\mathbf{a}^\top\mathbf{b}}{\mathbf{a}^\top\mathbf{a}}\mathbf{a}=\frac{6}{3}\mathbf{a}=(2,2,2)^\top.$$ Residual: $\mathbf{r}=\mathbf{b}-P\mathbf{b}=(0,-2,2)^\top$. Check $\mathbf{a}^\top\mathbf{r}=0-2+2=0$. ✓
Characteristic polynomial: $\det(A-\lambda I)=(4-\lambda)(3-\lambda)-2=\lambda^2-7\lambda+10=(\lambda-5)(\lambda-2)$, so $\lambda_1=5,\lambda_2=2$.
For $\lambda=5$: $(A-5I)\mathbf{v}=\mathbf{0}$ gives $-v_1+v_2=0$, so $\mathbf{v}_1=(1,1)^\top$.
For $\lambda=2$: $2v_1+v_2=0$, so $\mathbf{v}_2=(1,-2)^\top$.
Thus $A=V\Lambda V^{-1}$ with $V=\begin{pmatrix}1&1\\1&-2\end{pmatrix}$, $V^{-1}=-\tfrac{1}{3}\begin{pmatrix}-2&-1\\-1&1\end{pmatrix}=\tfrac{1}{3}\begin{pmatrix}2&1\\1&-1\end{pmatrix}$.
Write $\mathbf{x}_0=c_1\mathbf{v}_1+c_2\mathbf{v}_2$. Solving: $c_1+c_2=1,\,c_1-2c_2=0 \Rightarrow c_1=2/3,c_2=1/3$. Therefore $$A^{10}\mathbf{x}_0 = \tfrac{2}{3}\cdot 5^{10}\mathbf{v}_1 + \tfrac{1}{3}\cdot 2^{10}\mathbf{v}_2 = \tfrac{2\cdot 5^{10}+2^{10}}{3}\binom{1}{1}\cdot\ldots$$ Concretely $A^{10}\mathbf{x}_0 = \left(\tfrac{2\cdot 5^{10}+2^{10}}{3},\,\tfrac{2\cdot 5^{10}-2\cdot 2^{10}}{3}\right)^\top$. Numerically: $5^{10}=9{,}765{,}625$, $2^{10}=1024$, giving approximately $(6{,}510{,}758,\;6{,}509{,}734)^\top$.
Sylvester: leading minors must be positive. $M_{11}=2>0$ always. $\det M = 6-a^2 > 0 \Leftrightarrow |a|<\sqrt{6}$. So $M\succ 0$ iff $a\in(-\sqrt 6,\sqrt 6)$.
Using the cookbook: $\nabla\left(\tfrac12 \mathbf{x}^\top A\mathbf{x}\right)=A\mathbf{x}$ (since $A$ symmetric), $\nabla(-\mathbf{b}^\top\mathbf{x})=-\mathbf{b}$. So $\nabla f = A\mathbf{x}-\mathbf{b}$. Setting this to $\mathbf{0}$: $\mathbf{x}^*=A^{-1}\mathbf{b}$, valid when $A\succ 0$ (so $f$ is strictly convex and the minimizer is unique).
$A^\top A = \begin{pmatrix}25&0\\0&0\end{pmatrix}$ with eigenvalues $25,0$, so singular values are $\sigma_1=5,\sigma_2=0$. Right singular vector: $\mathbf{v}_1=(1,0)^\top$, $\mathbf{v}_2=(0,1)^\top$.
$\mathbf{u}_1=A\mathbf{v}_1/\sigma_1 = (3,4)^\top/5 = (3/5,4/5)^\top$. Complete to orthonormal basis: $\mathbf{u}_2=(-4/5,3/5)^\top$. Thus $$A=\underbrace{\begin{pmatrix}3/5&-4/5\\4/5&3/5\end{pmatrix}}_{U}\underbrace{\begin{pmatrix}5&0\\0&0\end{pmatrix}}_{\Sigma}\underbrace{\begin{pmatrix}1&0\\0&1\end{pmatrix}}_{V^\top}.$$ The matrix has rank 1, and the best rank-1 approximation is $A$ itself.
Apply Woodbury with $U=\mathbf{u}$, $C=1$, $V=\mathbf{v}^\top$: $$(A+\mathbf{u}\mathbf{v}^\top)^{-1} = A^{-1} - \frac{A^{-1}\mathbf{u}\mathbf{v}^\top A^{-1}}{1+\mathbf{v}^\top A^{-1}\mathbf{u}}.$$ The update is invertible iff the denominator $1+\mathbf{v}^\top A^{-1}\mathbf{u}\neq 0$. This formula costs $O(n^2)$ per rank-1 update versus $O(n^3)$ for re-inversion — hence its role in Kalman filters and recursive least squares.
Substitute the SVD. $A^\top A = V\Sigma^\top\Sigma V^\top = V D V^\top$ where $D=\mathrm{diag}(\sigma_i^2)$. Then $A^\top A+\lambda I = V(D+\lambda I)V^\top$, so $(A^\top A+\lambda I)^{-1} = V (D+\lambda I)^{-1} V^\top$.
Also $A^\top\mathbf{b} = V\Sigma^\top U^\top\mathbf{b}$, with $\Sigma^\top U^\top\mathbf{b}$ the vector whose $i$th entry is $\sigma_i\,\mathbf{u}_i^\top\mathbf{b}$. Combining, $$\mathbf{x}_\lambda = V\cdot\mathrm{diag}\!\left(\frac{\sigma_i}{\sigma_i^2+\lambda}\right)\cdot U^\top \mathbf{b} = \sum_i \frac{\sigma_i}{\sigma_i^2+\lambda}(\mathbf{u}_i^\top\mathbf{b})\mathbf{v}_i.$$
The factor $\sigma_i/(\sigma_i^2+\lambda)$ approaches $1/\sigma_i$ for $\sigma_i\gg\sqrt\lambda$ (behaves like ordinary least squares — OLS) and goes to $\sigma_i/\lambda\to 0$ for $\sigma_i\ll\sqrt\lambda$: ridge damps directions where data carry little signal, which is why it tames ill-conditioning and reduces variance.
Diagonalize $A=Q\Lambda Q^\top$. Substitute $\mathbf{y}=Q^\top\mathbf{x}$; the Rayleigh quotient becomes $\mathbf{y}^\top\Lambda\mathbf{y}/\mathbf{y}^\top\mathbf{y} = \sum \lambda_i y_i^2 / \sum y_i^2$. This is a convex combination of eigenvalues, maximized by putting all mass on the $y_i$ with the largest $\lambda_i$. So the max is $\lambda_{\max}$, achieved at $\mathbf{y}=\mathbf{e}_{i^*}$, i.e. $\mathbf{x}$ proportional to the top eigenvector. Analogously the min is $\lambda_{\min}$.
ML connection: the first PCA direction maximizes variance $\mathbf{v}^\top\Sigma\mathbf{v}/\mathbf{v}^\top\mathbf{v}$, so it's the top eigenvector of $\Sigma$.
Write $S=\sum_j e^{z_j}$. Then $\partial p_i/\partial z_k = \partial/\partial z_k (e^{z_i}/S)$. For $i=k$: $p_i - p_i^2 = p_i(1-p_i)$. For $i\neq k$: $-e^{z_i}e^{z_k}/S^2 = -p_i p_k$. Compactly, $J_{ik}=p_i(\delta_{ik}-p_k)$, where $\delta_{ik}$ is the Kronecker delta ($=1$ if $i=k$, else $0$). This is $\mathrm{diag}(\mathbf{p})-\mathbf{p}\mathbf{p}^\top$. Note it is symmetric PSD and satisfies $J\mathbf{1}=\mathbf{0}$ — here $\mathbf{1}$ is the all-ones vector — so softmax is scale-invariant to additive constants.
$\mathrm{Cov}(\mathbf{r}) = \sigma_f^2\boldsymbol\beta\boldsymbol\beta^\top + D$.
With $A=D$, $\mathbf{u}=\sigma_f\boldsymbol\beta$, $\mathbf{v}=\sigma_f\boldsymbol\beta$, Sherman–Morrison gives $$\Sigma^{-1} = D^{-1} - \frac{\sigma_f^2\,D^{-1}\boldsymbol\beta\boldsymbol\beta^\top D^{-1}}{1+\sigma_f^2\boldsymbol\beta^\top D^{-1}\boldsymbol\beta}.$$ Cost: $O(n)$ once $D^{-1}\boldsymbol\beta$ is computed. Generalizes to $k$-factor models via Woodbury, replacing $\sigma_f^2\boldsymbol\beta\boldsymbol\beta^\top$ by $B\,\Sigma_f B^\top$, where $B\in\mathbb{R}^{n\times k}$ is the loadings matrix (column $j$ gives each asset's exposure to factor $j$) and $\Sigma_f\in\mathbb{R}^{k\times k}$ is the factor-return covariance. This is exactly why risk-model inversions are cheap in production.
The language of change — from gradient flows that train networks to stochastic differential equations (SDEs) that price derivatives.
An ordinary differential equation (ODE) relates a function $y(t)$ to its derivatives. A first-order ODE has the form $y'=f(t,y)$. Under mild conditions (Picard–Lindelöf: $f$ Lipschitz in $y$), a unique local solution exists given $y(t_0)=y_0$.
$y' = g(t)h(y)$. Write $\frac{dy}{h(y)}=g(t)\,dt$ and integrate both sides.
$y'+p(t)y=q(t)$. Integrating factor $\mu(t)=e^{\int p\,dt}$ gives $(\mu y)' = \mu q$, so $y=\mu^{-1}\int\mu q\,dt+C\mu^{-1}$.
$M(t,y)\,dt+N(t,y)\,dy=0$ with $\partial M/\partial y=\partial N/\partial t$. A potential $F$ exists with $F_t=M, F_y=N$; the implicit solution is $F(t,y)=C$.
Exponential decay / growth: $y'=\lambda y\Rightarrow y(t)=y_0 e^{\lambda t}$.
Logistic: $y'=ry(1-y/K)\Rightarrow y(t)=K/(1+(K/y_0-1)e^{-rt})$. Models population and sigmoidal activations in continuous time.
Newton's cooling: $y'=-k(y-y_\infty)\Rightarrow y=y_\infty+(y_0-y_\infty)e^{-kt}$. The same first-order linear template appears everywhere: exponential moving averages, Ornstein–Uhlenbeck drift, resistor–capacitor (RC) circuits.
Consider $y''+py'+qy=g(t)$ with constants $p,q$. The general solution is $y=y_h+y_p$ where $y_h$ solves the homogeneous equation and $y_p$ is any particular solution.
Try $y=e^{rt}$. Characteristic equation: $r^2+pr+q=0$, with roots $r_{1,2}=\tfrac{-p\pm\sqrt{p^2-4q}}{2}$.
| Discriminant | Roots | General solution $y_h$ |
|---|---|---|
| $p^2-4q>0$ | $r_1,r_2$ real, distinct | $C_1 e^{r_1 t}+C_2 e^{r_2 t}$ |
| $p^2-4q=0$ | $r$ repeated | $(C_1+C_2 t)e^{rt}$ |
| $p^2-4q<0$ | $\alpha\pm i\beta$ | $e^{\alpha t}(C_1\cos\beta t+C_2\sin\beta t)$ |
$my'' + cy' + ky = 0$. With $\omega_0^2 = k/m$, $\zeta = c/(2\sqrt{mk})$:
This framework classifies momentum-based optimizers (heavy-ball, Nesterov) viewed as discretizations of damped oscillators.
A linear system $\mathbf{x}'(t) = A\mathbf{x}(t)$ with $A\in\mathbb{R}^{n\times n}$ has the unique solution $\mathbf{x}(t) = e^{At}\mathbf{x}(0)$, where the matrix exponential is $$e^{At} = \sum_{k=0}^\infty \frac{(At)^k}{k!}.$$
When $A$ is diagonalizable, $A=V\Lambda V^{-1}$ and $e^{At}=V e^{\Lambda t} V^{-1}$ with $e^{\Lambda t}=\mathrm{diag}(e^{\lambda_i t})$. The solution decomposes into modes: writing $\mathbf{x}(0)=\sum c_i \mathbf{v}_i$, $$\mathbf{x}(t)=\sum_i c_i e^{\lambda_i t}\mathbf{v}_i.$$ Each eigenvalue is a mode: $\mathrm{Re}(\lambda_i)>0$ means growth, $<0$ decay, $\mathrm{Im}(\lambda_i)\neq 0$ oscillation.
For $\mathbf{x}'=A\mathbf{x}+\mathbf{g}(t)$, the variation-of-parameters formula gives $$\mathbf{x}(t)=e^{At}\mathbf{x}(0)+\int_0^t e^{A(t-s)}\mathbf{g}(s)\,ds.$$
If $A$ is not diagonalizable, use Jordan form $A=PJP^{-1}$ where $J$ is block-diagonal with Jordan blocks $J_k = \lambda I + N$, $N$ nilpotent. Then $e^{J_k t}=e^{\lambda t}\sum_{j=0}^{k-1}\tfrac{(Nt)^j}{j!}$ — finite sum.
An equilibrium $\mathbf{x}^*$ of $\mathbf{x}'=\mathbf{f}(\mathbf{x})$ satisfies $\mathbf{f}(\mathbf{x}^*)=\mathbf{0}$. Linearize: $\mathbf{y}=\mathbf{x}-\mathbf{x}^*$ evolves approximately as $\mathbf{y}'=J\mathbf{y}$ where $J=\partial\mathbf{f}/\partial\mathbf{x}\big|_{\mathbf{x}^*}$.
For a 2×2 Jacobian with trace $\tau$ and determinant $\Delta$, the equilibrium is:
| Sign pattern | Type | Stability |
|---|---|---|
| $\Delta>0,\tau<0,\tau^2>4\Delta$ | Stable node | Attracting |
| $\Delta>0,\tau<0,\tau^2<4\Delta$ | Stable spiral | Attracting oscillator |
| $\Delta>0,\tau>0$ | Unstable node/spiral | Repelling |
| $\Delta<0$ | Saddle | Unstable (hyperbolic) |
| $\Delta>0,\tau=0$ | Center | Neutrally stable (linear) |
Three archetypal linear second-order partial differential equations (PDEs) span most applications in physics, engineering, and quant.
$u_t = \alpha u_{xx}$ where $\alpha>0$ is the diffusivity (thermal conductivity / thermal capacity). Fundamental solution on $\mathbb{R}$: $K(x,t)=\tfrac{1}{\sqrt{4\pi\alpha t}}e^{-x^2/(4\alpha t)}$ (the heat kernel). For general initial condition $u_0$, $u(x,t)=(K(\cdot,t)*u_0)(x)$ where $*$ denotes convolution. The Black–Scholes PDE reduces to the heat equation under a change of variables.
$u_{tt} = c^2 u_{xx}$ where $c>0$ is the propagation speed. d'Alembert: $u(x,t)=\tfrac{1}{2}[u_0(x-ct)+u_0(x+ct)] + \tfrac{1}{2c}\int_{x-ct}^{x+ct}v_0(s)\,ds$, given initial position $u_0$ and velocity $v_0$. Finite speed of propagation $c$.
$\Delta u = 0$ (Laplace) or $\Delta u = f$ (Poisson). Models steady-state diffusion, electrostatics, harmonic problems. Solutions inherit strong regularity (harmonic functions satisfy the mean-value property).
A Brownian motion (or Wiener process) $W_t$ is a continuous stochastic process with $W_0=0$, independent increments, and $W_t - W_s \sim \mathcal{N}(0,t-s)$. Its paths are continuous but nowhere differentiable; informally, $dW_t \sim \mathcal{N}(0,dt)$.
An Itô SDE has the form $$dX_t = \mu(X_t,t)\,dt + \sigma(X_t,t)\,dW_t,$$ with drift $\mu$ and diffusion $\sigma$. Existence and uniqueness hold under Lipschitz and linear-growth conditions on $\mu,\sigma$.
For a function $f(x,t)$ that is twice continuously differentiable in $x$ and once in $t$ (written $f\in C^{2,1}$), and $X_t$ as above, $$df(X_t,t) = \left(f_t + \mu f_x + \tfrac{1}{2}\sigma^2 f_{xx}\right) dt + \sigma f_x\, dW_t.$$ The extra $\tfrac{1}{2}\sigma^2 f_{xx}$ term is the hallmark of Itô calculus; it arises because $(dW_t)^2 = dt$ in the mean-square sense.
Geometric Brownian motion (GBM): $dS_t = \mu S_t\,dt + \sigma S_t\,dW_t$. Apply Itô to $\log S_t$: $$d(\log S_t) = (\mu - \tfrac{1}{2}\sigma^2)\,dt + \sigma\, dW_t.$$ Hence $S_t = S_0 \exp\big((\mu-\tfrac{1}{2}\sigma^2)t + \sigma W_t\big)$, lognormally distributed. This is the Black–Scholes stock model.
Under the risk-neutral measure (with constant risk-free rate $r$), the price $V(S,t)$ of a European option written on an asset $S_t$ following $dS=rS\,dt+\sigma S\,dW$ satisfies $$V_t + rS V_S + \tfrac{1}{2}\sigma^2 S^2 V_{SS} - rV = 0,$$ with terminal condition $V(S,T)=\mathrm{payoff}(S)$ at maturity $T$. For a European call, the closed-form solution is the celebrated Black–Scholes formula.
Ornstein–Uhlenbeck (OU) process: $dX_t = \theta(\bar X - X_t)\,dt + \sigma\,dW_t$, where $\theta>0$ is the mean-reversion speed, $\bar X$ is the long-run mean, and $\sigma>0$ is the diffusion amplitude. Mean-reverting; stationary distribution $\mathcal{N}(\bar X,\sigma^2/(2\theta))$. Used for interest-rate models (Vasicek) and noise schedules in diffusion generative models.
We approximate the solution of $\mathbf{x}'=\mathbf{f}(\mathbf{x},t)$ on a grid $t_0 $\mathbf{x}_{n+1}=\mathbf{x}_n + h\mathbf{f}(\mathbf{x}_n,t_n)$. Order 1. Simple, cheap, poor stability for stiff problems. $\mathbf{x}_{n+1}=\mathbf{x}_n + h\mathbf{f}(\mathbf{x}_{n+1},t_{n+1})$. Order 1 but $A$-stable. Requires solving a (possibly nonlinear) system per step. Evaluate $\mathbf{f}$ at the midpoint of a trial Euler step. Second-order Runge–Kutta (RK2). Order 2. Classical 4-stage Runge–Kutta (RK4). Order 4. The workhorse for non-stiff, moderate-accuracy problems. Embedded Runge–Kutta pair — RK4(5) — with automatic step-size control. Used in Preserve energy for Hamiltonian systems over long integrations. Backbone of Hamiltonian Monte Carlo (HMC) samplers. A method is $A$-stable if for the test equation $y'=\lambda y$ with $\mathrm{Re}(\lambda)<0$, $y_n\to 0$ for any step $h>0$. Explicit methods have bounded stability regions; stiff problems (wide spread of time scales) demand implicit methods or exponential integrators.Explicit Euler
Implicit Euler
Midpoint / RK2
RK4
Adaptive (Dormand–Prince)
scipy.integrate.solve_ivp.Symplectic (Verlet, leapfrog)
Stability and stiffness
SDE schemes
The continuous-time limit of gradient descent is $\dot{\mathbf{x}} = -\nabla f(\mathbf{x})$. For a quadratic $f(\mathbf{x})=\tfrac{1}{2}\mathbf{x}^\top A\mathbf{x}$ with $A\succ 0$, $\mathbf{x}(t) = e^{-At}\mathbf{x}(0)$. Decay along eigendirection $\mathbf{v}_i$ has rate $\lambda_i$: small eigenvalues give slow modes, which is why preconditioning (dividing by $A$) flattens the convergence landscape.
Heavy-ball descent, in the continuous limit, becomes $\ddot{\mathbf{x}} + \gamma \dot{\mathbf{x}} + \nabla f(\mathbf{x}) = 0$, where $\gamma>0$ is the damping coefficient. Optimal tuning critically damps: $\gamma = 2\sqrt{\mu}$ for a $\mu$-strongly convex $f$ (i.e., $\nabla^2 f \succeq \mu I$ with strong-convexity constant $\mu>0$). This gives the accelerated $O(\sqrt{\kappa}\log(1/\varepsilon))$ rate of Nesterov's method, where $\kappa = L/\mu$ is the condition number ($L$ being the Lipschitz constant of $\nabla f$) and $\varepsilon$ is the target accuracy.
A residual block $\mathbf{h}_{n+1}=\mathbf{h}_n + f_\theta(\mathbf{h}_n)$ is one Euler step of $\dot{\mathbf{h}} = f_\theta(\mathbf{h},t)$. Neural ODEs replace finite depth with the flow of an ODE solved by an adaptive integrator, with gradients via the adjoint method (itself an ODE).
The forward noising process is an SDE $d\mathbf{x}_t = -\tfrac{1}{2}\beta_t\mathbf{x}_t\,dt + \sqrt{\beta_t}\,d\mathbf{W}_t$ (an OU with time-varying coefficient), where $\beta_t>0$ is a prescribed noise schedule controlling how fast data is corrupted toward pure noise, and $\mathbf{W}_t$ is a standard multidimensional Brownian motion. Let $p_t(\mathbf{x})$ denote the marginal density of $\mathbf{x}_t$. The reverse-time SDE (Anderson, 1982) exists and has the form $d\mathbf{x}_t = [-\tfrac{1}{2}\beta_t\mathbf{x}_t - \beta_t\nabla_{\mathbf{x}}\log p_t(\mathbf{x})]\,dt + \sqrt{\beta_t}\,d\bar{\mathbf{W}}_t$, where $\bar{\mathbf{W}}_t$ is a Brownian motion running backward in time. The quantity $\nabla_{\mathbf{x}}\log p_t(\mathbf{x})$ is called the score. Training a score network to approximate it then lets us integrate the reverse SDE (or the deterministic probability-flow ODE with the same marginals) to generate samples.
Separate: $\dfrac{dy}{y(1-y)} = dt$. Partial fractions $\dfrac{1}{y(1-y)} = \dfrac{1}{y}+\dfrac{1}{1-y}$. Integrate: $\ln|y|-\ln|1-y| = t+C$, so $y/(1-y)=Ae^t$. Initial condition: $A=1$. Hence $y(t)=\dfrac{e^t}{1+e^t}=\sigma(t)$ — the standard sigmoid. This is the continuous-time logistic and shows up as the generalized-linear-model (GLM) link function.
Integrating factor $\mu = e^{2t}$. Then $(e^{2t}y)' = e^{2t}\cdot e^{-t} = e^t$, so $e^{2t}y = e^t + C \Rightarrow y = e^{-t} + Ce^{-2t}$. Using $y(0)=0$: $1+C=0\Rightarrow C=-1$. Therefore $y(t) = e^{-t}-e^{-2t}$.
Characteristic: $r^2+4r+5=0 \Rightarrow r=-2\pm i$. Underdamped ($\zeta = 4/(2\sqrt 5) = 2/\sqrt 5 < 1$). General solution $y = e^{-2t}(C_1\cos t + C_2 \sin t)$.
$y(0)=C_1=1$. $y'(0) = -2C_1 + C_2 = 0 \Rightarrow C_2 = 2$. So $y(t) = e^{-2t}(\cos t + 2\sin t)$.
Eigenvalues $\pm i$ with eigenvectors $(1,\mp i)^\top$. Since $A^2=-I$, the Taylor series splits into even/odd parts: $$e^{At} = I\cos t + A\sin t = \begin{pmatrix}\cos t & -\sin t\\ \sin t & \cos t\end{pmatrix}.$$ So $e^{At}$ rotates vectors counterclockwise by angle $t$. With initial condition $(1,0)^\top$: $\mathbf{x}(t)=(\cos t,\sin t)^\top$.
Trace $\tau=-6$, determinant $\Delta=9-1=8$. $\tau^2-4\Delta = 36-32 = 4 > 0$, $\tau<0$, $\Delta>0$ → stable node.
Eigenvalues: $\lambda = -3\pm 1 \in\{-2,-4\}$. Eigenvectors: for $-2$, $(1,1)^\top$; for $-4$, $(1,-1)^\top$. Trajectories approach the origin tangent to the slow direction $(1,1)^\top$ (corresponding to the larger $\lambda=-2$).
$\mathbf{x}(t) = e^{-At}\mathbf{x}(0) = (e^{-t}, e^{-100t})^\top$. The second coordinate decays $100\times$ faster than the first. After $t$ large enough to kill the fast mode, the residual error is dominated by the slow mode — convergence of the loss is governed by $\lambda_{\min}(A)=1$, and the condition number $\kappa(A)=100$ measures how much worse this is than the isotropic case. In discrete time, steepest descent with step $\eta$ converges as $(1-\eta\lambda_{\min})^k$ while requiring $\eta\le 2/\lambda_{\max}$, hence the well-known $(\kappa-1)/(\kappa+1)$ contraction.
With $f(x)=x^2$: $f_x = 2x$, $f_{xx}=2$. Itô: $dY_t = (2S_t\cdot\mu S_t + \tfrac{1}{2}\cdot 2\cdot \sigma^2 S_t^2)\,dt + 2S_t\cdot \sigma S_t\,dW_t$, i.e. $$dY_t = (2\mu + \sigma^2)Y_t\,dt + 2\sigma Y_t\,dW_t.$$ So $Y_t$ is itself GBM with drift $2\mu+\sigma^2$ and volatility $2\sigma$. Note the extra $\sigma^2$ from the Itô correction.
This is linear. Use variation of parameters: $X_t = e^{-\theta t}x_0 + \sigma\int_0^t e^{-\theta(t-s)}\,dW_s$.
The Itô integral is a mean-zero Gaussian. So $\mathbb{E}[X_t] = e^{-\theta t}x_0$ and, by the Itô isometry — which states $\mathbb{E}\big[(\int_0^t g(s)\,dW_s)^2\big] = \int_0^t g(s)^2\,ds$ for deterministic $g$ — $$\mathrm{Var}(X_t) = \sigma^2\int_0^t e^{-2\theta(t-s)}\,ds = \frac{\sigma^2}{2\theta}(1 - e^{-2\theta t}).$$
As $t\to\infty$ (and $\theta>0$), $X_t \Rightarrow \mathcal{N}(0,\sigma^2/(2\theta))$ — the stationary distribution. This is the noise law used in Langevin samplers and score-based generative models.
Start from $V_t + rSV_S + \tfrac{1}{2}\sigma^2 S^2 V_{SS} - rV = 0$. With $x=\log S$: $V_S = V_x/S$, $V_{SS} = (V_{xx}-V_x)/S^2$. Substituting, $$V_t + rV_x + \tfrac{1}{2}\sigma^2(V_{xx}-V_x) - rV = 0 \Rightarrow V_t + (r-\tfrac{1}{2}\sigma^2)V_x + \tfrac{1}{2}\sigma^2 V_{xx} - rV = 0.$$ Now set $\tau = T - t$ (so $V_t = -V_\tau$) and $u = e^{r\tau}V$ (so $V = e^{-r\tau}u$, $V_t = -V_\tau = e^{-r\tau}(ru - u_\tau)$). After simplification, $$u_\tau = (r-\tfrac{1}{2}\sigma^2)u_x + \tfrac{1}{2}\sigma^2 u_{xx}.$$ A further shift $y = x + (r-\tfrac{1}{2}\sigma^2)\tau$ kills the drift and leaves the canonical heat equation $u_\tau = \tfrac{1}{2}\sigma^2 u_{yy}$, which is solved by Gaussian convolution with the terminal payoff — yielding the Black–Scholes formula.
Take $V(\mathbf{x}) = f(\mathbf{x}) - f(\mathbf{0})\ge 0$, with $V(\mathbf{0})=0$ and $V(\mathbf{x})\ge \tfrac{\mu}{2}\|\mathbf{x}\|^2$ by strong convexity.
Along trajectories $\dot V = \nabla f\cdot \dot{\mathbf{x}} = -\|\nabla f\|^2\le 0$. By the Polyak–Łojasiewicz (PL) inequality — which relates gradient magnitude to suboptimality and is implied here by $\mu$-strong convexity — $$\|\nabla f(\mathbf{x})\|^2\ge 2\mu\big(f(\mathbf{x})-f^*\big)=2\mu\, V(\mathbf{x}),$$ where $f^* = \min_\mathbf{x} f(\mathbf{x}) = f(\mathbf{0})$ is the optimal value.
Therefore $\dot V \le -2\mu V$, which gives $V(\mathbf{x}(t))\le V(\mathbf{x}(0))e^{-2\mu t}$ by Grönwall's inequality (integrating the differential inequality $\dot V \le -2\mu V$). Combining with the strong-convexity lower bound $V(\mathbf{x})\ge\tfrac{\mu}{2}\|\mathbf{x}\|^2$, one gets $\|\mathbf{x}(t)\|\le C e^{-\mu t}$ for some constant $C>0$: linear (exponential) convergence. This is the continuous-time analogue of the well-known discrete linear rate for gradient descent under strong convexity.