Diffusion Models — A Cheatsheet for ML, Quant Research & Image Generation

Part I

Stochastic Foundations

Brownian motion, Itô calculus, and the Fokker–Planck machinery that makes diffusion generative models legible.

01Brownian motion and stochastic calculus

Before we write any equation, fix the three pieces of vocabulary that everything below rests on.

Probability space & sample outcomes We work on a probability space $(\Omega,\mathcal{F},\mathbb{P})$: $\Omega$ is the set of possible "random outcomes" $\omega$, $\mathcal{F}$ is the collection of events we can assign probabilities to (a $\sigma$-algebra), and $\mathbb{P}:\mathcal{F}\to[0,1]$ is the probability measure. A stochastic process $X_t$ is a collection of random variables indexed by time $t$: for each $\omega$, $t\mapsto X_t(\omega)$ is a deterministic curve called a sample path (or realization, trajectory). The process is the whole random object; a sample path is what you actually see when you run the experiment once.

Convergence vocabulary "Almost surely" (a.s., or "with probability 1") means the statement holds on a set of outcomes $\{\omega: \ldots\}\in\mathcal{F}$ of probability $1$. "In probability" means $\mathbb{P}(|X_n-X|>\varepsilon)\to 0$ for every $\varepsilon>0$. "$L^2$ (mean-square) convergence" means $\mathbb{E}|X_n-X|^2\to 0$. These three modes appear throughout stochastic calculus; $L^2$ is the one used to define the Itô integral.

Standard Brownian motion Standard Brownian motion (Wiener process) $W_t$ on $[0,\infty)$ is a stochastic process satisfying

$W_0=0$ almost surely;
independent increments: for $0\le s\le t\le u\le v$, the random variables $W_t-W_s$ and $W_v-W_u$ are independent;
Gaussian increments: $W_t-W_s\sim\mathcal{N}(0,t-s)$ — centered Gaussian with variance equal to the time gap;
continuous sample paths: for almost every $\omega$, the function $t\mapsto W_t(\omega)$ is continuous.

What a Brownian path actually looks like

A Brownian path is the single realized curve $t\mapsto W_t(\omega)$ for one fixed outcome $\omega$. It is the picture you draw when you integrate the SDE once. Three striking facts — all holding almost surely:

Continuous. The trajectory can be drawn without lifting your pen.
Nowhere differentiable. At every time $t$, the derivative $dW_t/dt$ fails to exist. The path wiggles so violently that no local tangent line exists anywhere. This is why we cannot write SDEs as ordinary ODEs driven by "white noise $\xi_t=dW_t/dt$" without a limiting procedure.
Self-similar under time rescaling. $W_{ct}\overset d= \sqrt c\,W_t$: rescaling time by $c$ is the same, in distribution, as rescaling the path by $\sqrt c$. Zooming into a Brownian path reveals the same statistical roughness at every scale.

This self-similarity is why $(dW_t)^2 \sim dt$, not $dt^2$: over a short interval $dt$, the increment has standard deviation $\sqrt{dt}$, so its square is of order $dt$, exactly one power below what classical calculus would give.

Quadratic variation For any partition $0=t_0mesh $\max_i(t_{i+1}-t_i)\to 0$, $$[W,W]_t := \lim_{n\to\infty}\sum_{i}(W_{t_{i+1}}-W_{t_i})^2 = t \quad\text{in probability.}$$ We write this as the mnemonic $(dW_t)^2 = dt$. Brownian motion accumulates $t$ units of "quadratic length" over time $t$, even though its classical (first-order) variation is infinite. This single identity is what distinguishes stochastic calculus from ordinary calculus — every Itô correction term is a consequence of it.

Information over time: filtration and adaptedness

Most stochastic-calculus statements refer to "information available by time $t$". That is formalized as a filtration.

Filtration A filtration $\{\mathcal{F}_t\}_{t\ge 0}$ is an increasing family of $\sigma$-algebras inside $\mathcal{F}$, i.e. $\mathcal{F}_s\subseteq\mathcal{F}_t$ whenever $s\le t$. Think of $\mathcal{F}_t$ as "the collection of events whose truth or falsehood has already been decided by time $t$". Under the natural filtration of Brownian motion, $\mathcal{F}_t = \sigma(W_s: s\le t)$ — knowledge of the Brownian path up to time $t$.

Adapted process A stochastic process $\phi_t$ is adapted to $\{\mathcal{F}_t\}$ if $\phi_t$ is $\mathcal{F}_t$-measurable for every $t$: its value at time $t$ depends only on information available up to $t$, never on future noise. Every sensible controller, filter, or score network used in a diffusion model is adapted by construction — it only sees $\mathbf{x}_t$ and $t$, not $\mathbf{x}_{t+\delta}$.

Martingale An adapted process $M_t$ with $\mathbb{E}|M_t|<\infty$ is a martingale if $$\mathbb{E}[M_t\mid\mathcal{F}_s] = M_s \qquad\text{for all } 0\le s\le t.$$ In plain English: the best forecast of the future given what we know now is the current value — a "fair game" with zero expected drift. Brownian motion itself is a martingale; so is the Itô integral of any adapted, square-integrable integrand. Martingales are central because what you cannot predict, you cannot exploit — pricing in finance, variance reduction in Monte Carlo, and many diffusion-model convergence proofs all ultimately rest on a martingale argument.

The Itô integral

Given an adapted process $\phi_t$ with $\mathbb{E}\!\int_0^t\phi_s^2\,ds<\infty$, the Itô integral $\int_0^t\phi_s\,dW_s$ is defined as the $L^2$-limit (i.e. $\mathbb{E}|S_n - S|^2\to 0$) of left-endpoint Riemann sums, $$S_n := \sum_{i=0}^{n-1}\phi_{t_i}\big(W_{t_{i+1}}-W_{t_i}\big),\qquad\text{mesh}\to 0.$$ Evaluating at the left endpoint is what makes the integrand "not peek at the future noise $W_{t_{i+1}}$", and what makes the resulting integral adapted and a martingale. Evaluating at the midpoint instead defines the Stratonovich integral, which obeys the ordinary chain rule but loses the martingale property and has a nonzero mean — less convenient for probability but often preferred in physics and geometric mechanics.

Itô isometry $$\mathbb{E}\!\left[\left(\int_0^t \phi_s\,dW_s\right)^{\!2}\right] = \mathbb{E}\!\left[\int_0^t \phi_s^2\,ds\right].$$ The left-hand side is a variance (the Itô integral has zero mean); the right-hand side is an ordinary Lebesgue integral. This identity is the workhorse for computing variances of diffusion processes — we use it repeatedly below to find the closed-form variance of the OU process, the DDPM forward marginal, and more.

Multivariate Brownian motion

A $d$-dimensional standard Brownian motion $\mathbf{W}_t = (W_t^{(1)},\ldots,W_t^{(d)})$ has independent coordinate Brownian motions. The increment covariance is $\mathbb{E}[d\mathbf{W}_t d\mathbf{W}_t^\top] = I_d\,dt$. For images we treat the pixel array as a flat vector in $\mathbb{R}^d$ and use $d$-dimensional Brownian noise.

02Itô's lemma and stochastic differential equations

An Itô stochastic differential equation (SDE) in $\mathbb{R}^d$ is $$d\mathbf{X}_t = \mathbf{f}(\mathbf{X}_t,t)\,dt + \mathbf{G}(\mathbf{X}_t,t)\,d\mathbf{W}_t,$$ with drift $\mathbf{f}:\mathbb{R}^d\times\mathbb{R}\to\mathbb{R}^d$ and diffusion matrix $\mathbf{G}:\mathbb{R}^d\times\mathbb{R}\to\mathbb{R}^{d\times k}$ driven by a $k$-dimensional Brownian motion. Existence and uniqueness hold under Lipschitz and linear-growth conditions on $\mathbf{f}$ and $\mathbf{G}$.

Itô's lemma For $\phi(\mathbf{x},t)\in C^{2,1}$ and $\mathbf{X}_t$ solving the SDE above, $$d\phi = \phi_t\,dt + (\nabla_\mathbf{x}\phi)^\top d\mathbf{X}_t + \tfrac{1}{2}\mathrm{tr}\!\big(\mathbf{G}\mathbf{G}^\top \nabla_\mathbf{x}^2 \phi\big)\,dt.$$ The trace term — absent from ordinary calculus — is the hallmark of Itô, arising from $(d\mathbf{W}_t)(d\mathbf{W}_t)^\top = I_k\,dt$.

Scalar form

For scalar $X_t$ with $dX_t = \mu\,dt + \sigma\,dW_t$ and $\phi(x,t)\in C^{2,1}$, $$d\phi(X_t,t) = \left(\phi_t + \mu\,\phi_x + \tfrac{1}{2}\sigma^2\,\phi_{xx}\right)dt + \sigma\,\phi_x\,dW_t.$$

Why the diffusion term contributes to the drift

Ordinary calculus would give $d\phi = \phi_x\,dX$. But $(dX_t)^2 = \sigma^2(dW_t)^2 = \sigma^2\,dt$ is non-negligible, producing the convexity correction $\tfrac{1}{2}\sigma^2\phi_{xx}\,dt$. A convex $\phi$ grows in expectation even along a martingale $X_t$ — this is the content of Jensen's inequality made dynamical.

03Linear Gaussian SDEs: Ornstein–Uhlenbeck and geometric Brownian motion

Two SDEs dominate both quantitative finance and diffusion generative modeling: they admit closed-form solutions and Gaussian marginals.

Ornstein–Uhlenbeck

The Ornstein–Uhlenbeck (OU) process $$dX_t = -\theta X_t\,dt + \sigma\,dW_t, \qquad \theta>0,$$ is mean-reverting. Multiplying through by $e^{\theta t}$ and applying Itô, $$d(e^{\theta t}X_t) = e^{\theta t}\sigma\,dW_t \implies X_t = e^{-\theta t}X_0 + \sigma\int_0^t e^{-\theta(t-s)}\,dW_s.$$ The Itô integral is a zero-mean Gaussian with variance $\frac{\sigma^2}{2\theta}(1-e^{-2\theta t})$, so $$X_t\mid X_0 \sim \mathcal{N}\!\left(e^{-\theta t}X_0,\ \tfrac{\sigma^2}{2\theta}(1-e^{-2\theta t})\right).$$ Stationary distribution: $\mathcal{N}(0,\sigma^2/(2\theta))$.

OU is the backbone of the DDPM variance-preserving SDE — only the coefficients are made time-dependent so the stationary distribution is the standard Gaussian.

Geometric Brownian motion

Geometric Brownian motion (GBM): $dS_t = \mu S_t\,dt + \sigma S_t\,dW_t$. Apply Itô to $\log S_t$: $$d(\log S_t) = (\mu - \tfrac{1}{2}\sigma^2)\,dt + \sigma\,dW_t \implies S_t = S_0 \exp\!\big((\mu-\tfrac{1}{2}\sigma^2)t + \sigma W_t\big).$$ Lognormal marginals. The foundation of Black–Scholes option pricing.

04The Fokker–Planck equation

If $\mathbf{X}_t$ solves $d\mathbf{X}_t = \mathbf{f}\,dt + \mathbf{G}\,d\mathbf{W}_t$ and $p_t(\mathbf{x})$ is the density of $\mathbf{X}_t$, then $p_t$ evolves by the Fokker–Planck (forward Kolmogorov) equation $$\partial_t p_t = -\nabla\!\cdot\!(\mathbf{f}\,p_t) + \tfrac{1}{2}\nabla\!\cdot\!\nabla\!\cdot\!(\mathbf{G}\mathbf{G}^\top p_t),$$ where the double divergence means $\sum_{ij}\partial_i\partial_j\big[(\mathbf{G}\mathbf{G}^\top)_{ij}p_t\big]$. It is a deterministic PDE governing the flow of probability mass.

Mnemonic Drift pushes mass; diffusion spreads it. The first term is advection; the second is a second-order (Laplacian-like) smoothing.

Stationary distributions

A stationary density $p_\infty$ satisfies $\partial_t p_\infty = 0$. For the OU process $dX = -\theta X\,dt + \sigma\,dW$, $$0 = \partial_x(\theta x\,p_\infty) + \tfrac{1}{2}\sigma^2\partial_x^2 p_\infty,$$ whose solution is $p_\infty(x) \propto \exp(-\theta x^2/\sigma^2) = \mathcal{N}(0,\sigma^2/(2\theta))$, recovering the result above.

Backward Kolmogorov

The expectation $u(\mathbf{x},t)=\mathbb{E}[\phi(\mathbf{X}_T)\mid \mathbf{X}_t=\mathbf{x}]$ solves the backward equation $$\partial_t u + \mathbf{f}^\top\nabla u + \tfrac{1}{2}\mathrm{tr}(\mathbf{G}\mathbf{G}^\top\nabla^2 u) = 0, \quad u(\mathbf{x},T)=\phi(\mathbf{x}).$$ This is the Feynman–Kac link between PDEs and stochastic expectations — it underlies Black–Scholes pricing and, more generally, the control-as-inference view of diffusion models.

05Time-reversal of diffusions and the score

Anderson (1982) If $d\mathbf{X}_t = \mathbf{f}(\mathbf{X}_t,t)\,dt + g(t)\,d\mathbf{W}_t$ (scalar diffusion for simplicity), then the time-reversed process $\bar{\mathbf{X}}_s := \mathbf{X}_{T-s}$ satisfies $$d\bar{\mathbf{X}}_s = \big[-\mathbf{f}(\bar{\mathbf{X}}_s, T-s) + g(T-s)^2\,\nabla\log p_{T-s}(\bar{\mathbf{X}}_s)\big]\,ds + g(T-s)\,d\tilde{\mathbf{W}}_s,$$ where $\tilde{\mathbf{W}}_s$ is a Brownian motion adapted to the reverse filtration.

The extra term $g^2\nabla\log p_t(\mathbf{x})$ is the hallmark of reversal. Writing it as $g^2 \mathbf{s}_t(\mathbf{x})$ with $$\mathbf{s}_t(\mathbf{x}) := \nabla_\mathbf{x}\log p_t(\mathbf{x}),$$ we call $\mathbf{s}_t$ the (Stein) score. If we can learn the score, we can integrate the reverse SDE to walk backward from noise to data — which is precisely what a diffusion generative model does.

Intuition Forward noising forgets the data: diffusion spreads mass. Reverse remembers: the drift tilts trajectories back toward regions of high data density. The score points up the density's gradient; following it is a kind of gradient ascent into the data manifold.

Why the reverse has a diffusion term too

The reverse process inherits randomness because each "noise step" of the forward is only probabilistically invertible: many paths can produce the same final state. Anderson's result is that this conditional uncertainty collapses to a new Brownian motion under the reverse measure. A deterministic reversal also exists — the probability-flow ODE — at the cost of changing trajectories while preserving marginals (§12).

06Exercises · Stochastic foundations

Exercise 1 · Moments of Brownian motion★

Show that $\mathbb{E}[W_t^2]=t$ and $\mathbb{E}[W_t^4]=3t^2$. Deduce $\mathrm{Var}(W_t^2)=2t^2$.

Show solution

$W_t\sim\mathcal{N}(0,t)$, so the moments of a centered Gaussian with variance $t$ give $\mathbb{E}[W_t^2]=t$ and $\mathbb{E}[W_t^4]=3t^2$ (using $\mathbb{E}[Z^4]=3$ for $Z\sim\mathcal{N}(0,1)$). Therefore $\mathrm{Var}(W_t^2)=\mathbb{E}[W_t^4]-(\mathbb{E}[W_t^2])^2 = 3t^2 - t^2 = 2t^2$. The variance grows as $t^2$, so the quadratic-variation identity $[W,W]_t=t$ is not a statement about $W_t^2$ pointwise but about the sum of squared increments as the mesh shrinks — the fluctuations average out in the limit.

Exercise 2 · Itô applied to $W_t^2$★

Compute $d(W_t^2)$ and deduce $\int_0^t W_s\,dW_s = \tfrac{1}{2}(W_t^2 - t)$.

Show solution

Apply Itô to $\phi(x)=x^2$ with $X_t=W_t$ (i.e. $\mu=0,\sigma=1$): $\phi_x=2x, \phi_{xx}=2$, so $$d(W_t^2) = 2W_t\,dW_t + \tfrac{1}{2}\cdot 1\cdot 2\,dt = 2W_t\,dW_t + dt.$$ Integrating from $0$ to $t$, $W_t^2 = 2\int_0^t W_s\,dW_s + t$, hence $\int_0^t W_s\,dW_s = \tfrac{1}{2}(W_t^2-t)$. The $-t/2$ is the Itô correction — classical calculus would give the wrong answer $\tfrac{1}{2}W_t^2$.

Exercise 3 · Solving the OU SDE★★

Show that the OU process $dX_t = -\theta X_t\,dt+\sigma\,dW_t$ with $X_0=x_0$ has the closed form $X_t = x_0 e^{-\theta t} + \sigma\int_0^t e^{-\theta(t-s)}\,dW_s$ and compute its mean and variance. Identify the stationary distribution.

Show solution

Let $Y_t=e^{\theta t}X_t$. By Itô (or Leibniz applied to a deterministic integrating factor), $dY_t = \theta e^{\theta t}X_t\,dt + e^{\theta t}dX_t = e^{\theta t}\sigma\,dW_t$. Integrate: $Y_t-Y_0 = \sigma\int_0^t e^{\theta s}\,dW_s$, so $X_t = x_0 e^{-\theta t} + \sigma\int_0^t e^{-\theta(t-s)}\,dW_s$.

Mean: $\mathbb{E}[X_t] = x_0 e^{-\theta t}$ (Itô integral has zero mean). Variance, by Itô isometry: $$\mathrm{Var}(X_t) = \sigma^2\int_0^t e^{-2\theta(t-s)}\,ds = \frac{\sigma^2}{2\theta}(1-e^{-2\theta t}).$$ As $t\to\infty$: $X_t\Rightarrow\mathcal{N}(0,\sigma^2/(2\theta))$. This is the stationary distribution, which will reappear as the terminal noise distribution of the DDPM forward process.

Exercise 4 · Fokker–Planck for OU★★

Write the Fokker–Planck equation for $dX_t=-\theta X_t\,dt+\sigma\,dW_t$ and verify that $p_\infty(x)\propto \exp(-\theta x^2/\sigma^2)$ is stationary.

Show solution

Here $f(x)=-\theta x$, $g=\sigma$. Fokker–Planck: $$\partial_t p = -\partial_x(-\theta x\,p) + \tfrac{1}{2}\sigma^2\partial_x^2 p = \theta\,\partial_x(x\,p) + \tfrac{1}{2}\sigma^2 p_{xx}.$$ Guess $p_\infty(x) = C\exp(-\theta x^2/\sigma^2)$. Then $p'_\infty = -\frac{2\theta x}{\sigma^2}p_\infty$ and $p''_\infty = \big(\frac{4\theta^2 x^2}{\sigma^4} - \frac{2\theta}{\sigma^2}\big)p_\infty$. Also $\partial_x(xp_\infty) = p_\infty + xp'_\infty = \big(1 - \frac{2\theta x^2}{\sigma^2}\big)p_\infty$.

Combining: $\theta\big(1-\frac{2\theta x^2}{\sigma^2}\big) + \tfrac{1}{2}\sigma^2\big(\frac{4\theta^2 x^2}{\sigma^4}-\frac{2\theta}{\sigma^2}\big) = \theta - \frac{2\theta^2 x^2}{\sigma^2} + \frac{2\theta^2 x^2}{\sigma^2} - \theta = 0$. Hence $p_\infty$ is stationary, and normalization gives $C=\sqrt{\theta/(\pi\sigma^2)}$, i.e. $\mathcal{N}(0,\sigma^2/(2\theta))$.

Exercise 5 · Reverse-time SDE for OU★★★

For the OU process $dX_t=-\theta X_t\,dt+\sigma\,dW_t$ with $X_0\sim\mathcal{N}(0,\sigma^2/(2\theta))$ (stationary), show that the score is $\nabla_x\log p_t(x) = -2\theta x/\sigma^2$ and write Anderson's reverse SDE. Observe the process is time-reversible.

Show solution

Starting from the stationary distribution, $X_t\sim p_\infty = \mathcal{N}(0,\sigma^2/(2\theta))$ for all $t$. Hence $p_t(x) \propto \exp(-\theta x^2/\sigma^2)$ and $\nabla_x\log p_t(x) = -2\theta x/\sigma^2$.

Anderson's reversal (scalar, $g=\sigma$): $d\bar X_s = [-f + g^2\nabla\log p_t]ds + g\,d\tilde W_s = [\theta \bar X_s + \sigma^2\cdot(-2\theta\bar X_s/\sigma^2)]ds + \sigma\,d\tilde W_s = -\theta \bar X_s\,ds + \sigma\,d\tilde W_s$.

The reverse SDE has the same form as the forward! This is the classical statement that a stationary OU process is time-reversible. Away from stationarity, the score is time-dependent and the reverse drift differs.

Part II

Diffusion Generative Models

A noising process you choose, a reverse you learn. DDPM, score matching, the SDE framework, and the guidance tricks that power modern image generators.

07The forward noising process (DDPM)

A diffusion generative model (Sohl-Dickstein 2015; Ho, Jain & Abbeel, 2020) destroys data with a fixed Markov chain and learns to reverse it. Let $\mathbf{x}_0\sim q_{\text{data}}$ and fix a variance schedule $\beta_1,\ldots,\beta_T\in(0,1)$. The forward process is $$q(\mathbf{x}_t\mid\mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t;\sqrt{1-\beta_t}\,\mathbf{x}_{t-1},\,\beta_t I).$$ Write $\alpha_t := 1-\beta_t$ and $\bar\alpha_t := \prod_{s=1}^t \alpha_s$.

Closed-form marginal Iterating the recursion, $$q(\mathbf{x}_t\mid\mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t;\sqrt{\bar\alpha_t}\,\mathbf{x}_0,\,(1-\bar\alpha_t)I).$$ Equivalently, $\mathbf{x}_t = \sqrt{\bar\alpha_t}\,\mathbf{x}_0 + \sqrt{1-\bar\alpha_t}\,\boldsymbol{\epsilon}$ with $\boldsymbol{\epsilon}\sim\mathcal{N}(0,I)$.

Fig. 1The forward process (top, blue) corrupts data to isotropic noise via a fixed Markov chain. The reverse process (bottom, dark) is a learned chain that inverts it — the generative model walks backward from $\mathbf{x}_T\sim\mathcal{N}(0,I)$ to a data sample $\mathbf{x}_0$.

As $t\to T$ with $\bar\alpha_T\to 0$, $q(\mathbf{x}_T\mid\mathbf{x}_0)\approx\mathcal{N}(0,I)$ — the prior. The schedule $\{\beta_t\}$ chooses how quickly signal is destroyed.

Schedules in practice

Linear

$\beta_t$ linear in $t$ from $10^{-4}$ to $0.02$ with $T=1000$ (Ho et al., 2020). Simple, widely used.

Cosine

$\bar\alpha_t = \cos^2\!\big(\tfrac{t/T+s}{1+s}\cdot\tfrac{\pi}{2}\big)$ with small $s$. Degrades signal more slowly near $t=0$; better for images at high resolution (Nichol & Dhariwal, 2021).

SNR parametrization

Signal-to-noise ratio $\mathrm{SNR}(t) = \bar\alpha_t/(1-\bar\alpha_t)$. A monotonically decreasing SNR is the true invariant — equivalent schedules give equivalent models (Kingma & Gao, 2023).

08The reverse process and the evidence lower bound

The generative model is a learned reverse Markov chain $$p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1};\boldsymbol{\mu}_\theta(\mathbf{x}_t,t),\,\Sigma_\theta(\mathbf{x}_t,t))$$ with $p(\mathbf{x}_T) = \mathcal{N}(0,I)$. The log-likelihood is bounded by the evidence lower bound $$-\log p_\theta(\mathbf{x}_0) \le \mathbb{E}_q\left[D_{\mathrm{KL}}(q(\mathbf{x}_T\mid\mathbf{x}_0)\,\|\,p(\mathbf{x}_T)) + \sum_{t>1} L_{t-1} - \log p_\theta(\mathbf{x}_0\mid\mathbf{x}_1)\right]$$ where $L_{t-1} = D_{\mathrm{KL}}(q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0)\,\|\,p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t))$.

Tractable posterior $q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1};\tilde{\boldsymbol{\mu}}_t,\tilde\beta_t I)$ with \[ \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t,\mathbf{x}_0) = \frac{\sqrt{\bar\alpha_{t-1}}\,\beta_t}{1-\bar\alpha_t}\mathbf{x}_0 + \frac{\sqrt{\alpha_t}\,(1-\bar\alpha_{t-1})}{1-\bar\alpha_t}\mathbf{x}_t,\qquad \tilde\beta_t = \frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\beta_t. \]

Noise-prediction parametrization

Substituting $\mathbf{x}_0 = \frac{1}{\sqrt{\bar\alpha_t}}(\mathbf{x}_t - \sqrt{1-\bar\alpha_t}\,\boldsymbol{\epsilon})$ and parameterizing the network to predict $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)$, $$\boldsymbol{\mu}_\theta(\mathbf{x}_t,t) = \frac{1}{\sqrt{\alpha_t}}\!\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)\right).$$ With fixed variance $\Sigma_\theta = \tilde\beta_t I$ (or $\beta_t I$ — both work well), the KL $L_{t-1}$ reduces to a weighted $\ell^2$ loss on noise, and Ho et al.'s simple training objective drops the weights: $$L_{\text{simple}} = \mathbb{E}_{t\sim\mathcal{U}\{1..T\},\,\mathbf{x}_0\sim q_{\text{data}},\,\boldsymbol{\epsilon}\sim\mathcal{N}(0,I)}\!\left[\,\big\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta\!\big(\sqrt{\bar\alpha_t}\mathbf{x}_0+\sqrt{1-\bar\alpha_t}\boldsymbol{\epsilon},t\big)\big\|^2\,\right].$$ Empirically, the unweighted loss trains better than the true ELBO.

In practice The network is almost always a U-Net with sinusoidal time embeddings, self-attention at low spatial resolutions, and group normalization. For text-conditioned image generation, cross-attention injects text-encoder features at every resolution.

Fig. 2DDPM training loop. For each minibatch: sample a datum $\mathbf{x}_0$, a timestep $t$, and Gaussian noise $\boldsymbol{\epsilon}$; compute $\mathbf{x}_t$ in closed form; ask the network to predict the noise; regress on $\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_\theta\|^2$.

09Score matching and denoising score matching

A second route to diffusion models, developed before DDPM, is to learn the score $\mathbf{s}_t(\mathbf{x}) := \nabla_\mathbf{x}\log p_t(\mathbf{x})$ directly.

Explicit score matching (Hyvärinen, 2005)

Minimize $J(\theta) = \tfrac{1}{2}\mathbb{E}_{p}[\|\mathbf{s}_\theta(\mathbf{x}) - \nabla\log p(\mathbf{x})\|^2]$. Integrating by parts eliminates the unknown $\nabla\log p$: $$J(\theta) = \mathbb{E}_p\!\left[\,\mathrm{tr}\!\big(\nabla \mathbf{s}_\theta(\mathbf{x})\big) + \tfrac{1}{2}\|\mathbf{s}_\theta(\mathbf{x})\|^2\,\right] + \text{const}.$$ The trace of the Jacobian is expensive for high-dimensional $\mathbf{x}$ — impractical for images.

Denoising score matching (Vincent, 2011)

Add a fixed perturbation kernel $q_\sigma(\tilde{\mathbf{x}}\mid\mathbf{x}) = \mathcal{N}(\tilde{\mathbf{x}};\mathbf{x},\sigma^2 I)$ and marginal $q_\sigma(\tilde{\mathbf{x}}) = \int q_\sigma(\tilde{\mathbf{x}}\mid\mathbf{x})p(\mathbf{x})d\mathbf{x}$. Then $$J_{\text{ESM}}(\theta) = \tfrac{1}{2}\mathbb{E}_{q_\sigma(\tilde{\mathbf{x}})}[\|\mathbf{s}_\theta(\tilde{\mathbf{x}})-\nabla\log q_\sigma(\tilde{\mathbf{x}})\|^2] = \tfrac{1}{2}\mathbb{E}_{p(\mathbf{x})q_\sigma(\tilde{\mathbf{x}}\mid\mathbf{x})}\!\left[\big\|\mathbf{s}_\theta(\tilde{\mathbf{x}}) - \nabla_{\tilde{\mathbf{x}}}\log q_\sigma(\tilde{\mathbf{x}}\mid\mathbf{x})\big\|^2\right] + C.$$ For Gaussian kernels $\nabla_{\tilde{\mathbf{x}}}\log q_\sigma(\tilde{\mathbf{x}}\mid\mathbf{x}) = -(\tilde{\mathbf{x}}-\mathbf{x})/\sigma^2$, which is cheap — just the normalized noise. So denoising-score matching trains a denoiser.

Equivalence with noise prediction

In DDPM variables, $\nabla_{\mathbf{x}_t}\log q(\mathbf{x}_t\mid\mathbf{x}_0) = -(\mathbf{x}_t - \sqrt{\bar\alpha_t}\mathbf{x}_0)/(1-\bar\alpha_t) = -\boldsymbol{\epsilon}/\sqrt{1-\bar\alpha_t}$. Hence $$\mathbf{s}_\theta(\mathbf{x}_t,t) = -\frac{\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)}{\sqrt{1-\bar\alpha_t}}.$$ Noise prediction and score prediction are the same model up to a time-dependent rescaling. DDPM and score-based models are two vocabularies for one object.

10Tweedie's formula and the posterior-mean denoiser

Tweedie (Robbins, 1956) If $\mathbf{X} = \mathbf{X}_0 + \sigma \mathbf{Z}$ with $\mathbf{Z}\sim\mathcal{N}(0,I)$ and marginal $p_\mathbf{X}$, then $$\mathbb{E}[\mathbf{X}_0\mid \mathbf{X}=\mathbf{x}] = \mathbf{x} + \sigma^2\,\nabla_\mathbf{x}\log p_\mathbf{X}(\mathbf{x}).$$

The MMSE denoiser is the identity plus the (noisy) score, scaled by noise power. Translated to DDPM, with $\mathbf{x}_t = \sqrt{\bar\alpha_t}\mathbf{x}_0 + \sqrt{1-\bar\alpha_t}\boldsymbol{\epsilon}$, $$\hat{\mathbf{x}}_0(\mathbf{x}_t) := \mathbb{E}[\mathbf{x}_0\mid \mathbf{x}_t] = \frac{1}{\sqrt{\bar\alpha_t}}\!\left(\mathbf{x}_t + (1-\bar\alpha_t)\nabla_{\mathbf{x}_t}\log p_t(\mathbf{x}_t)\right) = \frac{1}{\sqrt{\bar\alpha_t}}\!\left(\mathbf{x}_t - \sqrt{1-\bar\alpha_t}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)\right).$$

Why it matters Tweedie unifies three parametrizations a diffusion network can use: predicting the noise $\boldsymbol{\epsilon}$, the clean sample $\mathbf{x}_0$, or the score $\mathbf{s}$. All are linearly interconvertible. Modern training often uses the "$\mathbf{v}$-parametrization" $\mathbf{v} = \sqrt{\bar\alpha_t}\boldsymbol{\epsilon}-\sqrt{1-\bar\alpha_t}\mathbf{x}_0$ for better conditioning across noise levels (Salimans & Ho, 2022).

11The continuous-time SDE framework (VP, VE, sub-VP)

Song et al. (2021) unified DDPM and score-based models as discretizations of continuous-time SDEs on $t\in[0,T]$ of the form $$d\mathbf{x} = \mathbf{f}(\mathbf{x},t)\,dt + g(t)\,d\mathbf{W}_t,$$ with reverse $$d\mathbf{x} = \big[\mathbf{f}(\mathbf{x},t) - g(t)^2\,\nabla\log p_t(\mathbf{x})\big]\,dt + g(t)\,d\bar{\mathbf{W}}_t.$$

Family	Forward SDE	Marginal $p_{0t}(\mathbf{x}_t\mid\mathbf{x}_0)$	Limit as $t\to T$
VP-SDE (DDPM)	$d\mathbf{x} = -\tfrac{1}{2}\beta(t)\mathbf{x}\,dt + \sqrt{\beta(t)}\,d\mathbf{W}$	$\mathcal{N}(\mathbf{x}_0 e^{-\tfrac{1}{2}\!\int_0^t\!\beta(s)ds},\,(1-e^{-\int_0^t\!\beta(s)ds})I)$	$\mathcal{N}(0,I)$
VE-SDE (SMLD)	$d\mathbf{x} = \sqrt{\tfrac{d\sigma^2(t)}{dt}}\,d\mathbf{W}$	$\mathcal{N}(\mathbf{x}_0,\,\sigma(t)^2 I)$	$\mathcal{N}(\mathbf{x}_0,\sigma_{\max}^2 I)$
sub-VP	$d\mathbf{x} = -\tfrac{1}{2}\beta(t)\mathbf{x}\,dt + \sqrt{\beta(t)(1-e^{-2\!\int_0^t\!\beta(s)ds})}\,d\mathbf{W}$	tighter variance than VP	$\mathcal{N}(0,I)$

VP is the OU with a time-varying coefficient — the DDPM forward process in continuous time. VE is pure noise addition with a growing variance, as in Song & Ermon's original noise-conditional score network (NCSN). Sub-VP was introduced for slightly tighter likelihood bounds.

Training minimizes a denoising score-matching loss weighted across noise levels: $$\mathcal{L}(\theta) = \mathbb{E}_{t\sim\mathcal{U}[0,T]}\,\mathbb{E}_{\mathbf{x}_0}\,\mathbb{E}_{\mathbf{x}_t\mid\mathbf{x}_0}\!\left[\lambda(t)\,\|\mathbf{s}_\theta(\mathbf{x}_t,t) - \nabla_{\mathbf{x}_t}\log p_{0t}(\mathbf{x}_t\mid\mathbf{x}_0)\|^2\right].$$ Common choices: $\lambda(t) = g(t)^2$ (likelihood weighting, giving an exact ELBO) or $\lambda(t) = 1$ (the "simple" DDPM loss, better FID).

12Probability-flow ODE and deterministic sampling (DDIM)

Probability-flow ODE For the SDE $d\mathbf{x}=\mathbf{f}\,dt+g\,d\mathbf{W}$ with marginal density $p_t$, the deterministic ODE $$\frac{d\mathbf{x}}{dt} = \mathbf{f}(\mathbf{x},t) - \tfrac{1}{2}g(t)^2\,\nabla\log p_t(\mathbf{x})$$ has the same marginals $p_t$ as the SDE, though individual trajectories differ.

Proof sketch. Write Fokker–Planck as $\partial_t p_t = -\nabla\!\cdot\!(\mathbf{v}_t p_t)$ with effective velocity $\mathbf{v}_t(\mathbf{x}) = \mathbf{f}(\mathbf{x},t) - \tfrac{1}{2}g(t)^2\nabla\log p_t(\mathbf{x})$ (use $\tfrac{1}{2}g^2\nabla^2 p = \tfrac{1}{2}g^2\nabla\cdot(p\nabla\log p)$). Any continuity equation of this form is the density transport induced by the ODE $\dot{\mathbf{x}} = \mathbf{v}_t(\mathbf{x})$.

Fig. 3SDE and ODE reversals share the same marginals $p_t$ (shaded envelope) but travel different trajectories. The ODE is the deterministic backbone; DDIM discretizes it.

DDIM as a discretization

The denoising diffusion implicit model (DDIM; Song, Meng & Ermon, 2021) derives a non-Markovian family of samplers parameterized by $\eta\in[0,1]$: $$\mathbf{x}_{t-1} = \sqrt{\bar\alpha_{t-1}}\,\hat{\mathbf{x}}_0(\mathbf{x}_t) + \sqrt{1-\bar\alpha_{t-1}-\sigma_t^2}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t) + \sigma_t\,\mathbf{z}, \quad \mathbf{z}\sim\mathcal{N}(0,I),$$ with $\sigma_t = \eta\sqrt{\tilde\beta_t}$ and $\hat{\mathbf{x}}_0$ the Tweedie denoiser. Setting $\eta=1$ recovers DDPM; $\eta=0$ gives a deterministic map — the Euler discretization of the probability-flow ODE. Deterministic DDIM is:

Fast. 20–50 steps often suffice (vs 1000 for DDPM).
Invertible. The forward map $\mathbf{x}_0\to\mathbf{x}_T$ defines a bijection — basis of DDIM inversion for image editing.
Consistent. Different step counts land on the same trajectory.

Fig. 4Inference loop. Starting from pure noise $\mathbf{x}_T$, each step calls the network once, forms a clean-sample estimate via Tweedie, and moves to the next lower noise level. For latent diffusion, the final $\mathbf{x}_0$ is a latent and is decoded by $\mathcal{D}$ into an image.

13Conditional generation and classifier-free guidance

Conditional diffusion trains $p_\theta(\mathbf{x}_0\mid c)$ for condition $c$ (class label, text embedding, image). The naive approach is to feed $c$ as an extra network input: $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t,c)$. But simply conditioning is not enough — text-to-image samples then tend to ignore the prompt.

Classifier guidance (Dhariwal & Nichol, 2021)

Bayes' rule on densities: $\nabla\log p_t(\mathbf{x}\mid c) = \nabla\log p_t(\mathbf{x}) + \nabla\log p_t(c\mid\mathbf{x})$. Train a classifier $p_\phi(c\mid\mathbf{x}_t)$ on noisy inputs and replace the score with $$\mathbf{s}_\theta(\mathbf{x}_t,t) + w\,\nabla_{\mathbf{x}_t}\log p_\phi(c\mid\mathbf{x}_t),$$ where $w>1$ exaggerates the classifier's effect. Strong but requires a separate noisy-image classifier.

Classifier-free guidance (Ho & Salimans, 2022)

Train a single model on both conditional and unconditional objectives, dropping $c$ to $\emptyset$ with probability $p_{\text{drop}}\approx 10\%$. At sampling time, combine: $$\tilde{\boldsymbol{\epsilon}}_\theta(\mathbf{x}_t,t,c) = (1+w)\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t,c) - w\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t,\emptyset).$$ Interpreting via Bayes, this corresponds to sampling from a sharpened conditional $$\tilde p_t(\mathbf{x}\mid c) \propto p_t(\mathbf{x}\mid c)^{1+w}\,p_t(\mathbf{x})^{-w}.$$ Classifier-free guidance (CFG) is the dominant conditioning mechanism in Stable Diffusion, Imagen, DALL·E 3, Midjourney. Typical scales $w\in[5,15]$ for text-to-image; higher values trade diversity for fidelity to the prompt.

Fig. 5Classifier-free guidance. Left: the guided score is a linear extrapolation from the unconditional score toward the conditional score by factor $(1+w)$. Right: larger $w$ sharpens the effective conditional $\tilde p(\mathbf{x}\mid c)$, concentrating mass on prompt-consistent modes at the cost of diversity.

14Latent diffusion and architectural notes

Running diffusion at $512^2{\times}3$ pixels is expensive. Latent diffusion models (LDMs; Rombach et al., 2022 — the Stable Diffusion architecture) push the problem into a perceptually compressed latent space.

Stage 1 — autoencoder. A VQ- or KL-regularized variational autoencoder learns $\mathcal{E}:\mathbf{x}\to\mathbf{z}$ with compression factor $f$ (typically $8$), so a $512{\times}512$ RGB image becomes a $64{\times}64{\times}4$ latent. The decoder $\mathcal{D}:\mathbf{z}\to\hat{\mathbf{x}}$ is trained with an adversarial + perceptual loss.
Stage 2 — diffusion in latent space. A U-Net is trained on the DDPM/VP objective applied to $\mathbf{z}_0 = \mathcal{E}(\mathbf{x}_0)$. Conditioning (text, pose, depth) enters via cross-attention.
Sampling. Run DDIM/DPM-Solver in latent space, then decode.

Architectural patterns

U-Net backbone

Residual ConvBlocks + self-attention at lower resolutions (16², 8²); sinusoidal time embedding added to each block.

Cross-attention

Query = spatial feature, Key/Value = text-encoder output (CLIP or T5). Injects the prompt at every resolution.

Diffusion Transformer (DiT)

Replace the U-Net with a pure transformer on patched latents; scales cleanly to high capacity (Peebles & Xie, 2023).

ControlNet / adapters

Freeze the base U-Net, train a parallel encoder that takes auxiliary inputs (edges, depth, pose) and is summed into the skip connections.

15Flow matching and rectified flow

Flow matching (Lipman et al., 2023) generalizes diffusion as learning a time-dependent vector field $\mathbf{v}_\theta(\mathbf{x},t)$ that generates a prescribed probability path $\{p_t\}_{t\in[0,1]}$ from noise $p_0$ to data $p_1$. The ODE $d\mathbf{x}/dt = \mathbf{v}_\theta(\mathbf{x},t)$ transports samples.

Conditional flow-matching loss Choose a coupling $(\mathbf{x}_0,\mathbf{x}_1)\sim\pi$ and an interpolating path $\mathbf{x}_t = \psi_t(\mathbf{x}_0,\mathbf{x}_1)$ with target field $\mathbf{u}_t(\mathbf{x}_t\mid \mathbf{x}_0,\mathbf{x}_1) = \dot\psi_t$. Minimize $$\mathcal{L}_{\text{CFM}}(\theta) = \mathbb{E}_{t\sim\mathcal{U}[0,1],\,(\mathbf{x}_0,\mathbf{x}_1)\sim\pi}\left[\|\mathbf{v}_\theta(\mathbf{x}_t,t) - \dot\psi_t\|^2\right].$$ The resulting $\mathbf{v}_\theta$ generates the correct marginal path.

Rectified flow — the linear interpolant

The simplest choice: $\mathbf{x}_t = (1-t)\mathbf{x}_0 + t\mathbf{x}_1$ with $\pi = p_0\otimes p_{\text{data}}$. Target field is the constant $\mathbf{x}_1 - \mathbf{x}_0$. This is rectified flow (Liu, Gong & Liu, 2022) and the basis of Stable Diffusion 3 and Flux. Advantages: straight trajectories → few-step sampling, simpler conceptual picture than DDPM.

Relation to diffusion

Diffusion is a special case of flow matching with a particular (Gaussian) probability path. Specifically, the VP-SDE marginals can be written as $\mathbf{x}_t = \alpha_t \mathbf{x}_1 + \sigma_t \mathbf{x}_0$ with $\mathbf{x}_0\sim\mathcal{N}(0,I)$ and $\alpha_t^2 + \sigma_t^2 = 1$; the probability-flow ODE velocity equals the conditional-flow-matching target under this path. Training a score model or a flow-matching model is largely a matter of parametrization.

Fig. 6Rectified flow learns a velocity field whose integral curves are straight lines from noise to data. Because the conditional target is a constant $(\mathbf{x}_1-\mathbf{x}_0)$, few-step integrators remain accurate — the basis of Stable Diffusion 3 and Flux.

16Samplers and numerical integration

Sampling from a trained diffusion model means integrating an SDE or ODE from $t=T$ down to $t=0$. Quality is measured in both image fidelity (FID, CLIP score) and compute — typically reported as number of function evaluations (NFE), i.e. forward passes of the U-Net.

DDPM (ancestral)

Reverse SDE, Euler–Maruyama. $T\approx 1000$ steps. Strong quality, high NFE.

DDIM ($\eta=0$)

Euler on the probability-flow ODE. 20–50 NFE. Deterministic and invertible.

Heun / RK2 (EDM)

Karras et al. (2022): Heun's second-order predictor–corrector on a rescaled VE-SDE. Around 35 NFE for near-state-of-the-art FID.

DPM-Solver (2/3)

Lu et al. (2022): exponential-integrator solvers exploiting the semilinear structure of VP/VE ODEs. 10–20 NFE.

Consistency models

Song et al. (2023): train a one-step model that maps any noise level directly to $\mathbf{x}_0$. Single-step (or few-step) generation at reasonable quality.

Distillation

Progressive / adversarial distillation (Salimans & Ho, 2022; SDXL-Turbo): compress a $N$-step teacher into a $N/2$-step or $1$-step student.

Rule of thumb For likelihood, use the SDE sampler with likelihood weighting; for FID / human preference, use the probability-flow ODE with a higher-order solver. CFG scales interact with sampler choice — higher $w$ is more brittle with aggressive (low-NFE) solvers.

17Exercises · Diffusion models

Exercise 6 · Closed-form forward marginal★★

Given $q(\mathbf{x}_t\mid \mathbf{x}_{t-1}) = \mathcal{N}(\sqrt{\alpha_t}\mathbf{x}_{t-1},\,\beta_t I)$ with $\alpha_t=1-\beta_t$, prove by induction that $q(\mathbf{x}_t\mid\mathbf{x}_0) = \mathcal{N}(\sqrt{\bar\alpha_t}\mathbf{x}_0,(1-\bar\alpha_t)I)$, where $\bar\alpha_t = \prod_{s\le t}\alpha_s$.

Show solution

Base case. For $t=1$: $q(\mathbf{x}_1\mid\mathbf{x}_0) = \mathcal{N}(\sqrt{\alpha_1}\mathbf{x}_0,\beta_1 I) = \mathcal{N}(\sqrt{\bar\alpha_1}\mathbf{x}_0,(1-\bar\alpha_1)I)$ since $\bar\alpha_1=\alpha_1$ and $\beta_1 = 1-\alpha_1$.

Induction. Assume $\mathbf{x}_{t-1} = \sqrt{\bar\alpha_{t-1}}\mathbf{x}_0 + \sqrt{1-\bar\alpha_{t-1}}\boldsymbol{\epsilon}_1$ with $\boldsymbol{\epsilon}_1\sim\mathcal{N}(0,I)$. Then $\mathbf{x}_t = \sqrt{\alpha_t}\mathbf{x}_{t-1}+\sqrt{\beta_t}\boldsymbol{\epsilon}_2$ with independent $\boldsymbol{\epsilon}_2\sim\mathcal{N}(0,I)$, so $$\mathbf{x}_t = \sqrt{\alpha_t\bar\alpha_{t-1}}\mathbf{x}_0 + \sqrt{\alpha_t(1-\bar\alpha_{t-1})}\boldsymbol{\epsilon}_1 + \sqrt{\beta_t}\boldsymbol{\epsilon}_2.$$ The last two independent Gaussians combine into a single Gaussian with variance $\alpha_t(1-\bar\alpha_{t-1}) + \beta_t = \alpha_t - \alpha_t\bar\alpha_{t-1} + 1 - \alpha_t = 1-\bar\alpha_t$. And $\alpha_t\bar\alpha_{t-1}=\bar\alpha_t$. Hence $\mathbf{x}_t = \sqrt{\bar\alpha_t}\mathbf{x}_0 + \sqrt{1-\bar\alpha_t}\boldsymbol{\epsilon}$, completing the induction.

Exercise 7 · Tractable posterior★★★

Derive $q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0) = \mathcal{N}(\tilde{\boldsymbol{\mu}}_t,\tilde\beta_t I)$ with the coefficients of §8. Use Bayes' rule and complete the square.

Show solution

By Bayes, $q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0) \propto q(\mathbf{x}_t\mid\mathbf{x}_{t-1})\,q(\mathbf{x}_{t-1}\mid\mathbf{x}_0)$ — the Markov property eliminates $\mathbf{x}_0$ from the first factor. Both factors are Gaussian, so the product is Gaussian.

Log-density ignoring constants: $$-\tfrac{1}{2\beta_t}\|\mathbf{x}_t - \sqrt{\alpha_t}\mathbf{x}_{t-1}\|^2 - \tfrac{1}{2(1-\bar\alpha_{t-1})}\|\mathbf{x}_{t-1} - \sqrt{\bar\alpha_{t-1}}\mathbf{x}_0\|^2.$$ Collecting the quadratic term in $\mathbf{x}_{t-1}$: $-\tfrac{1}{2}\big(\tfrac{\alpha_t}{\beta_t} + \tfrac{1}{1-\bar\alpha_{t-1}}\big)\|\mathbf{x}_{t-1}\|^2$. The coefficient simplifies using $\alpha_t(1-\bar\alpha_{t-1})+\beta_t = 1-\bar\alpha_t$ to give precision $\frac{1-\bar\alpha_t}{\beta_t(1-\bar\alpha_{t-1})}$, i.e. variance $\tilde\beta_t = \frac{\beta_t(1-\bar\alpha_{t-1})}{1-\bar\alpha_t}$.

The linear term is $\big(\tfrac{\sqrt{\alpha_t}}{\beta_t}\mathbf{x}_t + \tfrac{\sqrt{\bar\alpha_{t-1}}}{1-\bar\alpha_{t-1}}\mathbf{x}_0\big)^\top\mathbf{x}_{t-1}$. Multiply by $\tilde\beta_t$: $$\tilde{\boldsymbol{\mu}}_t = \frac{\sqrt{\alpha_t}(1-\bar\alpha_{t-1})}{1-\bar\alpha_t}\mathbf{x}_t + \frac{\sqrt{\bar\alpha_{t-1}}\,\beta_t}{1-\bar\alpha_t}\mathbf{x}_0.$$ This is the mean of the tractable posterior, a convex combination of the noisy sample and the clean one.

Exercise 8 · Noise prediction ↔ score★★

Show that $\nabla_{\mathbf{x}_t}\log q(\mathbf{x}_t\mid\mathbf{x}_0) = -\boldsymbol{\epsilon}/\sqrt{1-\bar\alpha_t}$, and conclude $\mathbf{s}_\theta(\mathbf{x}_t,t) = -\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)/\sqrt{1-\bar\alpha_t}$.

Show solution

$q(\mathbf{x}_t\mid\mathbf{x}_0) = \mathcal{N}(\sqrt{\bar\alpha_t}\mathbf{x}_0, (1-\bar\alpha_t)I)$, so $\log q = -\tfrac{1}{2(1-\bar\alpha_t)}\|\mathbf{x}_t-\sqrt{\bar\alpha_t}\mathbf{x}_0\|^2 + \text{const}$, and $$\nabla_{\mathbf{x}_t}\log q = -\frac{\mathbf{x}_t - \sqrt{\bar\alpha_t}\mathbf{x}_0}{1-\bar\alpha_t} = -\frac{\sqrt{1-\bar\alpha_t}\,\boldsymbol{\epsilon}}{1-\bar\alpha_t} = -\frac{\boldsymbol{\epsilon}}{\sqrt{1-\bar\alpha_t}},$$ using $\mathbf{x}_t = \sqrt{\bar\alpha_t}\mathbf{x}_0 + \sqrt{1-\bar\alpha_t}\boldsymbol{\epsilon}$. Taking the conditional expectation over $\mathbf{x}_0\mid\mathbf{x}_t$ gives the marginal score, and the model's noise prediction is trained to match the posterior mean of $\boldsymbol{\epsilon}$, so $\mathbf{s}_\theta = -\boldsymbol{\epsilon}_\theta/\sqrt{1-\bar\alpha_t}$.

Exercise 9 · Tweedie's formula★★

Prove Tweedie's formula in the scalar case: if $X = X_0 + \sigma Z$ with $Z\sim\mathcal{N}(0,1)$ and marginal density $p_X$, then $\mathbb{E}[X_0\mid X=x] = x + \sigma^2\,p_X'(x)/p_X(x)$.

Show solution

Let $p_{X_0}$ be the density of $X_0$ and $\phi_\sigma$ the Gaussian kernel with variance $\sigma^2$. Then $$p_X(x) = \int p_{X_0}(x_0)\,\phi_\sigma(x-x_0)\,dx_0.$$ Differentiating under the integral and using $\phi_\sigma'(u) = -\frac{u}{\sigma^2}\phi_\sigma(u)$, $$p_X'(x) = \int p_{X_0}(x_0)\cdot\!\left(-\tfrac{x-x_0}{\sigma^2}\right)\phi_\sigma(x-x_0)\,dx_0 = -\tfrac{1}{\sigma^2}\big(x\,p_X(x) - \int x_0\,p_{X,X_0}(x,x_0)dx_0\big).$$ Dividing by $p_X(x)$: $p_X'(x)/p_X(x) = -\tfrac{1}{\sigma^2}\big(x - \mathbb{E}[X_0\mid X=x]\big)$, i.e. $\mathbb{E}[X_0\mid X=x] = x + \sigma^2\,p_X'(x)/p_X(x) = x + \sigma^2\,\partial_x\log p_X(x)$. The MMSE denoiser is the identity plus the score times the noise power.

Exercise 10 · ELBO reduces to $\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_\theta\|^2$★★★

Using the parametrization $\boldsymbol{\mu}_\theta(\mathbf{x}_t,t) = \frac{1}{\sqrt{\alpha_t}}\big(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\boldsymbol{\epsilon}_\theta\big)$ and fixed variance $\Sigma_\theta = \tilde\beta_t I$, show that the KL term $L_{t-1}$ equals $\frac{\beta_t^2}{2\alpha_t(1-\bar\alpha_t)\tilde\beta_t}\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_\theta\|^2$ up to constants. Argue that dropping the weight gives the "simple" loss.

Show solution

KL of two Gaussians with equal covariance $\Sigma = \tilde\beta_t I$: $D_{\mathrm{KL}} = \frac{1}{2\tilde\beta_t}\|\tilde{\boldsymbol{\mu}}_t - \boldsymbol{\mu}_\theta\|^2$.

From Ex 7, $\tilde{\boldsymbol{\mu}}_t$ can also be written in terms of $\mathbf{x}_t$ and $\boldsymbol{\epsilon}$ (substitute $\mathbf{x}_0 = (\mathbf{x}_t - \sqrt{1-\bar\alpha_t}\boldsymbol{\epsilon})/\sqrt{\bar\alpha_t}$): $$\tilde{\boldsymbol{\mu}}_t = \frac{1}{\sqrt{\alpha_t}}\!\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\boldsymbol{\epsilon}\right).$$ Hence $\tilde{\boldsymbol{\mu}}_t - \boldsymbol{\mu}_\theta = \frac{\beta_t}{\sqrt{\alpha_t}\sqrt{1-\bar\alpha_t}}(\boldsymbol{\epsilon}_\theta - \boldsymbol{\epsilon})$, and $$L_{t-1} = \frac{\beta_t^2}{2\alpha_t(1-\bar\alpha_t)\tilde\beta_t}\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_\theta\|^2.$$ Dropping the time-dependent prefactor leaves $L_{\text{simple}} = \mathbb{E}\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_\theta\|^2$. Empirically, the uniform weighting improves perceptual quality: the ELBO weights over-emphasize very high-noise steps which contribute little to perceptual detail.

Exercise 11 · Probability-flow ODE preserves marginals★★★

Show that the ODE $\dot{\mathbf{x}} = \mathbf{f}(\mathbf{x},t) - \tfrac{1}{2}g(t)^2\nabla\log p_t(\mathbf{x})$ has the same time-marginals $p_t$ as the SDE $d\mathbf{x} = \mathbf{f}\,dt + g\,d\mathbf{W}$, via the Fokker–Planck equation.

Show solution

The SDE's Fokker–Planck equation: $\partial_t p_t = -\nabla\!\cdot\!(\mathbf{f} p_t) + \tfrac{1}{2}g^2 \nabla^2 p_t$. Rewrite the Laplacian as a divergence using $\nabla^2 p = \nabla\cdot(\nabla p) = \nabla\cdot(p\,\nabla\log p)$: $$\partial_t p_t = -\nabla\!\cdot\!(\mathbf{f} p_t) + \tfrac{1}{2}g^2\nabla\!\cdot\!(p_t \nabla\log p_t) = -\nabla\!\cdot\!\big((\mathbf{f} - \tfrac{1}{2}g^2 \nabla\log p_t) p_t\big).$$ The right-hand side is the continuity equation for the deterministic flow $\dot{\mathbf{x}} = \mathbf{f}(\mathbf{x},t) - \tfrac{1}{2}g^2\nabla\log p_t(\mathbf{x})$. Both the ODE and SDE therefore evolve the same density $p_t$, even though individual trajectories differ.

Exercise 12 · DDIM update (deterministic)★★★

Starting from the Tweedie denoiser $\hat{\mathbf{x}}_0(\mathbf{x}_t) = (\mathbf{x}_t - \sqrt{1-\bar\alpha_t}\,\boldsymbol{\epsilon}_\theta)/\sqrt{\bar\alpha_t}$ and the constraint that the DDIM step lands on the correct marginal $q(\mathbf{x}_{t-1}\mid\mathbf{x}_0)$, derive the deterministic update $\mathbf{x}_{t-1} = \sqrt{\bar\alpha_{t-1}}\hat{\mathbf{x}}_0 + \sqrt{1-\bar\alpha_{t-1}}\boldsymbol{\epsilon}_\theta$.

Show solution

On the training path, $\mathbf{x}_{t-1} = \sqrt{\bar\alpha_{t-1}}\mathbf{x}_0 + \sqrt{1-\bar\alpha_{t-1}}\boldsymbol{\epsilon}$ for some standard Gaussian $\boldsymbol{\epsilon}$. If we want the step from $\mathbf{x}_t$ to $\mathbf{x}_{t-1}$ to be a deterministic map that respects the right marginals, plug in the model's posterior estimates:

$\mathbf{x}_0 \to \hat{\mathbf{x}}_0 = (\mathbf{x}_t-\sqrt{1-\bar\alpha_t}\boldsymbol{\epsilon}_\theta)/\sqrt{\bar\alpha_t}$ (Tweedie);
$\boldsymbol{\epsilon}\to\boldsymbol{\epsilon}_\theta$ (use the same noise, since it is the model's estimate of the added Gaussian direction at $\mathbf{x}_t$).

Then $\mathbf{x}_{t-1} = \sqrt{\bar\alpha_{t-1}}\hat{\mathbf{x}}_0 + \sqrt{1-\bar\alpha_{t-1}}\boldsymbol{\epsilon}_\theta$. This is the deterministic DDIM update. Adding stochasticity $\sigma_t \mathbf{z}$ recovers the general $\eta$-parametrized DDIM family and, at $\sigma_t = \sqrt{\tilde\beta_t}$, the DDPM sampler.

Exercise 13 · CFG as score modification★★

Show that the classifier-free-guidance combination $\tilde{\boldsymbol{\epsilon}} = (1+w)\boldsymbol{\epsilon}_\theta(\mathbf{x},c) - w\,\boldsymbol{\epsilon}_\theta(\mathbf{x},\emptyset)$ corresponds to sampling from the tilted density $\tilde p_t(\mathbf{x}\mid c) \propto p_t(\mathbf{x}\mid c)^{1+w}p_t(\mathbf{x})^{-w}$.

Show solution

Using $\mathbf{s} = -\boldsymbol{\epsilon}/\sqrt{1-\bar\alpha_t}$, the same linear combination of scores is $$\tilde{\mathbf{s}}(\mathbf{x}) = (1+w)\,\nabla\log p_t(\mathbf{x}\mid c) - w\,\nabla\log p_t(\mathbf{x}) = \nabla\log\big[p_t(\mathbf{x}\mid c)^{1+w}\,p_t(\mathbf{x})^{-w}\big].$$ So the guided sampler's score is the gradient of $\log\tilde p_t$ with $\tilde p_t \propto p_t(\mathbf{x}\mid c)^{1+w}p_t(\mathbf{x})^{-w}$. Using Bayes $p_t(\mathbf{x}\mid c) \propto p_t(c\mid\mathbf{x})p_t(\mathbf{x})$, we get $\tilde p_t(\mathbf{x}\mid c) \propto p_t(c\mid\mathbf{x})^{1+w}p_t(\mathbf{x})$ — a sharpened conditional, stretching the likelihood ratio. Large $w$ collapses the sample toward the conditional mode, at the cost of diversity.

Exercise 14 · Rectified flow and the conditional vector field★★

For the linear interpolant $\mathbf{x}_t = (1-t)\mathbf{x}_0 + t\mathbf{x}_1$ with $\mathbf{x}_0\sim\mathcal{N}(0,I)$, $\mathbf{x}_1\sim p_{\text{data}}$, compute the conditional target field $\mathbf{u}_t(\mathbf{x}_t\mid \mathbf{x}_0,\mathbf{x}_1)$ and write the conditional-flow-matching loss.

Show solution

$\mathbf{u}_t = d\mathbf{x}_t/dt = \mathbf{x}_1 - \mathbf{x}_0$, constant in $t$. The CFM loss becomes $$\mathcal{L}_{\text{CFM}}(\theta) = \mathbb{E}_{t\sim\mathcal{U}[0,1]}\,\mathbb{E}_{\mathbf{x}_0,\mathbf{x}_1}\big\|\mathbf{v}_\theta((1-t)\mathbf{x}_0+t\mathbf{x}_1, t) - (\mathbf{x}_1-\mathbf{x}_0)\big\|^2.$$ This is the rectified-flow training objective. Sampling: integrate $\dot{\mathbf{x}}=\mathbf{v}_\theta(\mathbf{x},t)$ from $\mathbf{x}(0)\sim\mathcal{N}(0,I)$ to $t=1$. Compared to DDPM-style parametrizations, the target field is constant along the trajectory, which empirically makes few-step sampling much easier — the path is approximately a straight line.

Exercise 15 · Reverse-SDE form of VP★★

Write out the reverse SDE and the probability-flow ODE for the VP-SDE $d\mathbf{x} = -\tfrac{1}{2}\beta(t)\mathbf{x}\,dt + \sqrt{\beta(t)}\,d\mathbf{W}_t$, expressed in terms of the learned noise prediction $\boldsymbol{\epsilon}_\theta(\mathbf{x},t)$.

Show solution

Here $\mathbf{f}(\mathbf{x},t) = -\tfrac{1}{2}\beta(t)\mathbf{x}$ and $g(t) = \sqrt{\beta(t)}$. The reverse SDE (Anderson): $$d\mathbf{x} = \!\left[-\tfrac{1}{2}\beta(t)\mathbf{x} - \beta(t)\mathbf{s}_\theta(\mathbf{x},t)\right]\!dt + \sqrt{\beta(t)}\,d\bar{\mathbf{W}}_t.$$ Probability-flow ODE: replace the second term by half and drop the noise: $\dot{\mathbf{x}} = -\tfrac{1}{2}\beta(t)\mathbf{x} - \tfrac{1}{2}\beta(t)\mathbf{s}_\theta(\mathbf{x},t)$.

Substitute $\mathbf{s}_\theta = -\boldsymbol{\epsilon}_\theta/\sqrt{1-\bar\alpha_t}$ (with $\bar\alpha_t = e^{-\int_0^t\beta(s)ds}$ in continuous time): $$d\mathbf{x} = \!\left[-\tfrac{1}{2}\beta(t)\mathbf{x} + \frac{\beta(t)}{\sqrt{1-\bar\alpha_t}}\boldsymbol{\epsilon}_\theta(\mathbf{x},t)\right]\!dt + \sqrt{\beta(t)}\,d\bar{\mathbf{W}}_t.$$ This is the continuous-time counterpart of the DDPM ancestral sampler. Discretizing the corresponding ODE with Euler at non-uniform time grid reproduces deterministic DDIM.