From Brownian motion and stochastic calculus to score matching, DDPM, DDIM, classifier-free guidance and flow matching — the working toolkit behind modern generative image models, distilled and paired with worked exercises.
Brownian motion, Itô calculus, and the Fokker–Planck machinery that makes diffusion generative models legible.
Before we write any equation, fix the three pieces of vocabulary that everything below rests on.
A Brownian path is the single realized curve $t\mapsto W_t(\omega)$ for one fixed outcome $\omega$. It is the picture you draw when you integrate the SDE once. Three striking facts — all holding almost surely:
This self-similarity is why $(dW_t)^2 \sim dt$, not $dt^2$: over a short interval $dt$, the increment has standard deviation $\sqrt{dt}$, so its square is of order $dt$, exactly one power below what classical calculus would give.
Most stochastic-calculus statements refer to "information available by time $t$". That is formalized as a filtration.
Given an adapted process $\phi_t$ with $\mathbb{E}\!\int_0^t\phi_s^2\,ds<\infty$, the Itô integral $\int_0^t\phi_s\,dW_s$ is defined as the $L^2$-limit (i.e. $\mathbb{E}|S_n - S|^2\to 0$) of left-endpoint Riemann sums, $$S_n := \sum_{i=0}^{n-1}\phi_{t_i}\big(W_{t_{i+1}}-W_{t_i}\big),\qquad\text{mesh}\to 0.$$ Evaluating at the left endpoint is what makes the integrand "not peek at the future noise $W_{t_{i+1}}$", and what makes the resulting integral adapted and a martingale. Evaluating at the midpoint instead defines the Stratonovich integral, which obeys the ordinary chain rule but loses the martingale property and has a nonzero mean — less convenient for probability but often preferred in physics and geometric mechanics.
A $d$-dimensional standard Brownian motion $\mathbf{W}_t = (W_t^{(1)},\ldots,W_t^{(d)})$ has independent coordinate Brownian motions. The increment covariance is $\mathbb{E}[d\mathbf{W}_t d\mathbf{W}_t^\top] = I_d\,dt$. For images we treat the pixel array as a flat vector in $\mathbb{R}^d$ and use $d$-dimensional Brownian noise.
An Itô stochastic differential equation (SDE) in $\mathbb{R}^d$ is $$d\mathbf{X}_t = \mathbf{f}(\mathbf{X}_t,t)\,dt + \mathbf{G}(\mathbf{X}_t,t)\,d\mathbf{W}_t,$$ with drift $\mathbf{f}:\mathbb{R}^d\times\mathbb{R}\to\mathbb{R}^d$ and diffusion matrix $\mathbf{G}:\mathbb{R}^d\times\mathbb{R}\to\mathbb{R}^{d\times k}$ driven by a $k$-dimensional Brownian motion. Existence and uniqueness hold under Lipschitz and linear-growth conditions on $\mathbf{f}$ and $\mathbf{G}$.
For scalar $X_t$ with $dX_t = \mu\,dt + \sigma\,dW_t$ and $\phi(x,t)\in C^{2,1}$, $$d\phi(X_t,t) = \left(\phi_t + \mu\,\phi_x + \tfrac{1}{2}\sigma^2\,\phi_{xx}\right)dt + \sigma\,\phi_x\,dW_t.$$
Ordinary calculus would give $d\phi = \phi_x\,dX$. But $(dX_t)^2 = \sigma^2(dW_t)^2 = \sigma^2\,dt$ is non-negligible, producing the convexity correction $\tfrac{1}{2}\sigma^2\phi_{xx}\,dt$. A convex $\phi$ grows in expectation even along a martingale $X_t$ — this is the content of Jensen's inequality made dynamical.
Two SDEs dominate both quantitative finance and diffusion generative modeling: they admit closed-form solutions and Gaussian marginals.
The Ornstein–Uhlenbeck (OU) process $$dX_t = -\theta X_t\,dt + \sigma\,dW_t, \qquad \theta>0,$$ is mean-reverting. Multiplying through by $e^{\theta t}$ and applying Itô, $$d(e^{\theta t}X_t) = e^{\theta t}\sigma\,dW_t \implies X_t = e^{-\theta t}X_0 + \sigma\int_0^t e^{-\theta(t-s)}\,dW_s.$$ The Itô integral is a zero-mean Gaussian with variance $\frac{\sigma^2}{2\theta}(1-e^{-2\theta t})$, so $$X_t\mid X_0 \sim \mathcal{N}\!\left(e^{-\theta t}X_0,\ \tfrac{\sigma^2}{2\theta}(1-e^{-2\theta t})\right).$$ Stationary distribution: $\mathcal{N}(0,\sigma^2/(2\theta))$.
OU is the backbone of the DDPM variance-preserving SDE — only the coefficients are made time-dependent so the stationary distribution is the standard Gaussian.
Geometric Brownian motion (GBM): $dS_t = \mu S_t\,dt + \sigma S_t\,dW_t$. Apply Itô to $\log S_t$: $$d(\log S_t) = (\mu - \tfrac{1}{2}\sigma^2)\,dt + \sigma\,dW_t \implies S_t = S_0 \exp\!\big((\mu-\tfrac{1}{2}\sigma^2)t + \sigma W_t\big).$$ Lognormal marginals. The foundation of Black–Scholes option pricing.
If $\mathbf{X}_t$ solves $d\mathbf{X}_t = \mathbf{f}\,dt + \mathbf{G}\,d\mathbf{W}_t$ and $p_t(\mathbf{x})$ is the density of $\mathbf{X}_t$, then $p_t$ evolves by the Fokker–Planck (forward Kolmogorov) equation $$\partial_t p_t = -\nabla\!\cdot\!(\mathbf{f}\,p_t) + \tfrac{1}{2}\nabla\!\cdot\!\nabla\!\cdot\!(\mathbf{G}\mathbf{G}^\top p_t),$$ where the double divergence means $\sum_{ij}\partial_i\partial_j\big[(\mathbf{G}\mathbf{G}^\top)_{ij}p_t\big]$. It is a deterministic PDE governing the flow of probability mass.
A stationary density $p_\infty$ satisfies $\partial_t p_\infty = 0$. For the OU process $dX = -\theta X\,dt + \sigma\,dW$, $$0 = \partial_x(\theta x\,p_\infty) + \tfrac{1}{2}\sigma^2\partial_x^2 p_\infty,$$ whose solution is $p_\infty(x) \propto \exp(-\theta x^2/\sigma^2) = \mathcal{N}(0,\sigma^2/(2\theta))$, recovering the result above.
The expectation $u(\mathbf{x},t)=\mathbb{E}[\phi(\mathbf{X}_T)\mid \mathbf{X}_t=\mathbf{x}]$ solves the backward equation $$\partial_t u + \mathbf{f}^\top\nabla u + \tfrac{1}{2}\mathrm{tr}(\mathbf{G}\mathbf{G}^\top\nabla^2 u) = 0, \quad u(\mathbf{x},T)=\phi(\mathbf{x}).$$ This is the Feynman–Kac link between PDEs and stochastic expectations — it underlies Black–Scholes pricing and, more generally, the control-as-inference view of diffusion models.
The extra term $g^2\nabla\log p_t(\mathbf{x})$ is the hallmark of reversal. Writing it as $g^2 \mathbf{s}_t(\mathbf{x})$ with $$\mathbf{s}_t(\mathbf{x}) := \nabla_\mathbf{x}\log p_t(\mathbf{x}),$$ we call $\mathbf{s}_t$ the (Stein) score. If we can learn the score, we can integrate the reverse SDE to walk backward from noise to data — which is precisely what a diffusion generative model does.
The reverse process inherits randomness because each "noise step" of the forward is only probabilistically invertible: many paths can produce the same final state. Anderson's result is that this conditional uncertainty collapses to a new Brownian motion under the reverse measure. A deterministic reversal also exists — the probability-flow ODE — at the cost of changing trajectories while preserving marginals (§12).
$W_t\sim\mathcal{N}(0,t)$, so the moments of a centered Gaussian with variance $t$ give $\mathbb{E}[W_t^2]=t$ and $\mathbb{E}[W_t^4]=3t^2$ (using $\mathbb{E}[Z^4]=3$ for $Z\sim\mathcal{N}(0,1)$). Therefore $\mathrm{Var}(W_t^2)=\mathbb{E}[W_t^4]-(\mathbb{E}[W_t^2])^2 = 3t^2 - t^2 = 2t^2$. The variance grows as $t^2$, so the quadratic-variation identity $[W,W]_t=t$ is not a statement about $W_t^2$ pointwise but about the sum of squared increments as the mesh shrinks — the fluctuations average out in the limit.
Apply Itô to $\phi(x)=x^2$ with $X_t=W_t$ (i.e. $\mu=0,\sigma=1$): $\phi_x=2x, \phi_{xx}=2$, so $$d(W_t^2) = 2W_t\,dW_t + \tfrac{1}{2}\cdot 1\cdot 2\,dt = 2W_t\,dW_t + dt.$$ Integrating from $0$ to $t$, $W_t^2 = 2\int_0^t W_s\,dW_s + t$, hence $\int_0^t W_s\,dW_s = \tfrac{1}{2}(W_t^2-t)$. The $-t/2$ is the Itô correction — classical calculus would give the wrong answer $\tfrac{1}{2}W_t^2$.
Let $Y_t=e^{\theta t}X_t$. By Itô (or Leibniz applied to a deterministic integrating factor), $dY_t = \theta e^{\theta t}X_t\,dt + e^{\theta t}dX_t = e^{\theta t}\sigma\,dW_t$. Integrate: $Y_t-Y_0 = \sigma\int_0^t e^{\theta s}\,dW_s$, so $X_t = x_0 e^{-\theta t} + \sigma\int_0^t e^{-\theta(t-s)}\,dW_s$.
Mean: $\mathbb{E}[X_t] = x_0 e^{-\theta t}$ (Itô integral has zero mean). Variance, by Itô isometry: $$\mathrm{Var}(X_t) = \sigma^2\int_0^t e^{-2\theta(t-s)}\,ds = \frac{\sigma^2}{2\theta}(1-e^{-2\theta t}).$$ As $t\to\infty$: $X_t\Rightarrow\mathcal{N}(0,\sigma^2/(2\theta))$. This is the stationary distribution, which will reappear as the terminal noise distribution of the DDPM forward process.
Here $f(x)=-\theta x$, $g=\sigma$. Fokker–Planck: $$\partial_t p = -\partial_x(-\theta x\,p) + \tfrac{1}{2}\sigma^2\partial_x^2 p = \theta\,\partial_x(x\,p) + \tfrac{1}{2}\sigma^2 p_{xx}.$$ Guess $p_\infty(x) = C\exp(-\theta x^2/\sigma^2)$. Then $p'_\infty = -\frac{2\theta x}{\sigma^2}p_\infty$ and $p''_\infty = \big(\frac{4\theta^2 x^2}{\sigma^4} - \frac{2\theta}{\sigma^2}\big)p_\infty$. Also $\partial_x(xp_\infty) = p_\infty + xp'_\infty = \big(1 - \frac{2\theta x^2}{\sigma^2}\big)p_\infty$.
Combining: $\theta\big(1-\frac{2\theta x^2}{\sigma^2}\big) + \tfrac{1}{2}\sigma^2\big(\frac{4\theta^2 x^2}{\sigma^4}-\frac{2\theta}{\sigma^2}\big) = \theta - \frac{2\theta^2 x^2}{\sigma^2} + \frac{2\theta^2 x^2}{\sigma^2} - \theta = 0$. Hence $p_\infty$ is stationary, and normalization gives $C=\sqrt{\theta/(\pi\sigma^2)}$, i.e. $\mathcal{N}(0,\sigma^2/(2\theta))$.
Starting from the stationary distribution, $X_t\sim p_\infty = \mathcal{N}(0,\sigma^2/(2\theta))$ for all $t$. Hence $p_t(x) \propto \exp(-\theta x^2/\sigma^2)$ and $\nabla_x\log p_t(x) = -2\theta x/\sigma^2$.
Anderson's reversal (scalar, $g=\sigma$): $d\bar X_s = [-f + g^2\nabla\log p_t]ds + g\,d\tilde W_s = [\theta \bar X_s + \sigma^2\cdot(-2\theta\bar X_s/\sigma^2)]ds + \sigma\,d\tilde W_s = -\theta \bar X_s\,ds + \sigma\,d\tilde W_s$.
The reverse SDE has the same form as the forward! This is the classical statement that a stationary OU process is time-reversible. Away from stationarity, the score is time-dependent and the reverse drift differs.
A noising process you choose, a reverse you learn. DDPM, score matching, the SDE framework, and the guidance tricks that power modern image generators.
A diffusion generative model (Sohl-Dickstein 2015; Ho, Jain & Abbeel, 2020) destroys data with a fixed Markov chain and learns to reverse it. Let $\mathbf{x}_0\sim q_{\text{data}}$ and fix a variance schedule $\beta_1,\ldots,\beta_T\in(0,1)$. The forward process is $$q(\mathbf{x}_t\mid\mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t;\sqrt{1-\beta_t}\,\mathbf{x}_{t-1},\,\beta_t I).$$ Write $\alpha_t := 1-\beta_t$ and $\bar\alpha_t := \prod_{s=1}^t \alpha_s$.
As $t\to T$ with $\bar\alpha_T\to 0$, $q(\mathbf{x}_T\mid\mathbf{x}_0)\approx\mathcal{N}(0,I)$ — the prior. The schedule $\{\beta_t\}$ chooses how quickly signal is destroyed.
$\beta_t$ linear in $t$ from $10^{-4}$ to $0.02$ with $T=1000$ (Ho et al., 2020). Simple, widely used.
$\bar\alpha_t = \cos^2\!\big(\tfrac{t/T+s}{1+s}\cdot\tfrac{\pi}{2}\big)$ with small $s$. Degrades signal more slowly near $t=0$; better for images at high resolution (Nichol & Dhariwal, 2021).
Signal-to-noise ratio $\mathrm{SNR}(t) = \bar\alpha_t/(1-\bar\alpha_t)$. A monotonically decreasing SNR is the true invariant — equivalent schedules give equivalent models (Kingma & Gao, 2023).
The generative model is a learned reverse Markov chain $$p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1};\boldsymbol{\mu}_\theta(\mathbf{x}_t,t),\,\Sigma_\theta(\mathbf{x}_t,t))$$ with $p(\mathbf{x}_T) = \mathcal{N}(0,I)$. The log-likelihood is bounded by the evidence lower bound $$-\log p_\theta(\mathbf{x}_0) \le \mathbb{E}_q\left[D_{\mathrm{KL}}(q(\mathbf{x}_T\mid\mathbf{x}_0)\,\|\,p(\mathbf{x}_T)) + \sum_{t>1} L_{t-1} - \log p_\theta(\mathbf{x}_0\mid\mathbf{x}_1)\right]$$ where $L_{t-1} = D_{\mathrm{KL}}(q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0)\,\|\,p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t))$.
Substituting $\mathbf{x}_0 = \frac{1}{\sqrt{\bar\alpha_t}}(\mathbf{x}_t - \sqrt{1-\bar\alpha_t}\,\boldsymbol{\epsilon})$ and parameterizing the network to predict $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)$, $$\boldsymbol{\mu}_\theta(\mathbf{x}_t,t) = \frac{1}{\sqrt{\alpha_t}}\!\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)\right).$$ With fixed variance $\Sigma_\theta = \tilde\beta_t I$ (or $\beta_t I$ — both work well), the KL $L_{t-1}$ reduces to a weighted $\ell^2$ loss on noise, and Ho et al.'s simple training objective drops the weights: $$L_{\text{simple}} = \mathbb{E}_{t\sim\mathcal{U}\{1..T\},\,\mathbf{x}_0\sim q_{\text{data}},\,\boldsymbol{\epsilon}\sim\mathcal{N}(0,I)}\!\left[\,\big\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta\!\big(\sqrt{\bar\alpha_t}\mathbf{x}_0+\sqrt{1-\bar\alpha_t}\boldsymbol{\epsilon},t\big)\big\|^2\,\right].$$ Empirically, the unweighted loss trains better than the true ELBO.
A second route to diffusion models, developed before DDPM, is to learn the score $\mathbf{s}_t(\mathbf{x}) := \nabla_\mathbf{x}\log p_t(\mathbf{x})$ directly.
Minimize $J(\theta) = \tfrac{1}{2}\mathbb{E}_{p}[\|\mathbf{s}_\theta(\mathbf{x}) - \nabla\log p(\mathbf{x})\|^2]$. Integrating by parts eliminates the unknown $\nabla\log p$: $$J(\theta) = \mathbb{E}_p\!\left[\,\mathrm{tr}\!\big(\nabla \mathbf{s}_\theta(\mathbf{x})\big) + \tfrac{1}{2}\|\mathbf{s}_\theta(\mathbf{x})\|^2\,\right] + \text{const}.$$ The trace of the Jacobian is expensive for high-dimensional $\mathbf{x}$ — impractical for images.
Add a fixed perturbation kernel $q_\sigma(\tilde{\mathbf{x}}\mid\mathbf{x}) = \mathcal{N}(\tilde{\mathbf{x}};\mathbf{x},\sigma^2 I)$ and marginal $q_\sigma(\tilde{\mathbf{x}}) = \int q_\sigma(\tilde{\mathbf{x}}\mid\mathbf{x})p(\mathbf{x})d\mathbf{x}$. Then $$J_{\text{ESM}}(\theta) = \tfrac{1}{2}\mathbb{E}_{q_\sigma(\tilde{\mathbf{x}})}[\|\mathbf{s}_\theta(\tilde{\mathbf{x}})-\nabla\log q_\sigma(\tilde{\mathbf{x}})\|^2] = \tfrac{1}{2}\mathbb{E}_{p(\mathbf{x})q_\sigma(\tilde{\mathbf{x}}\mid\mathbf{x})}\!\left[\big\|\mathbf{s}_\theta(\tilde{\mathbf{x}}) - \nabla_{\tilde{\mathbf{x}}}\log q_\sigma(\tilde{\mathbf{x}}\mid\mathbf{x})\big\|^2\right] + C.$$ For Gaussian kernels $\nabla_{\tilde{\mathbf{x}}}\log q_\sigma(\tilde{\mathbf{x}}\mid\mathbf{x}) = -(\tilde{\mathbf{x}}-\mathbf{x})/\sigma^2$, which is cheap — just the normalized noise. So denoising-score matching trains a denoiser.
In DDPM variables, $\nabla_{\mathbf{x}_t}\log q(\mathbf{x}_t\mid\mathbf{x}_0) = -(\mathbf{x}_t - \sqrt{\bar\alpha_t}\mathbf{x}_0)/(1-\bar\alpha_t) = -\boldsymbol{\epsilon}/\sqrt{1-\bar\alpha_t}$. Hence $$\mathbf{s}_\theta(\mathbf{x}_t,t) = -\frac{\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)}{\sqrt{1-\bar\alpha_t}}.$$ Noise prediction and score prediction are the same model up to a time-dependent rescaling. DDPM and score-based models are two vocabularies for one object.
The MMSE denoiser is the identity plus the (noisy) score, scaled by noise power. Translated to DDPM, with $\mathbf{x}_t = \sqrt{\bar\alpha_t}\mathbf{x}_0 + \sqrt{1-\bar\alpha_t}\boldsymbol{\epsilon}$, $$\hat{\mathbf{x}}_0(\mathbf{x}_t) := \mathbb{E}[\mathbf{x}_0\mid \mathbf{x}_t] = \frac{1}{\sqrt{\bar\alpha_t}}\!\left(\mathbf{x}_t + (1-\bar\alpha_t)\nabla_{\mathbf{x}_t}\log p_t(\mathbf{x}_t)\right) = \frac{1}{\sqrt{\bar\alpha_t}}\!\left(\mathbf{x}_t - \sqrt{1-\bar\alpha_t}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)\right).$$
Song et al. (2021) unified DDPM and score-based models as discretizations of continuous-time SDEs on $t\in[0,T]$ of the form $$d\mathbf{x} = \mathbf{f}(\mathbf{x},t)\,dt + g(t)\,d\mathbf{W}_t,$$ with reverse $$d\mathbf{x} = \big[\mathbf{f}(\mathbf{x},t) - g(t)^2\,\nabla\log p_t(\mathbf{x})\big]\,dt + g(t)\,d\bar{\mathbf{W}}_t.$$
| Family | Forward SDE | Marginal $p_{0t}(\mathbf{x}_t\mid\mathbf{x}_0)$ | Limit as $t\to T$ |
|---|---|---|---|
| VP-SDE (DDPM) | $d\mathbf{x} = -\tfrac{1}{2}\beta(t)\mathbf{x}\,dt + \sqrt{\beta(t)}\,d\mathbf{W}$ | $\mathcal{N}(\mathbf{x}_0 e^{-\tfrac{1}{2}\!\int_0^t\!\beta(s)ds},\,(1-e^{-\int_0^t\!\beta(s)ds})I)$ | $\mathcal{N}(0,I)$ |
| VE-SDE (SMLD) | $d\mathbf{x} = \sqrt{\tfrac{d\sigma^2(t)}{dt}}\,d\mathbf{W}$ | $\mathcal{N}(\mathbf{x}_0,\,\sigma(t)^2 I)$ | $\mathcal{N}(\mathbf{x}_0,\sigma_{\max}^2 I)$ |
| sub-VP | $d\mathbf{x} = -\tfrac{1}{2}\beta(t)\mathbf{x}\,dt + \sqrt{\beta(t)(1-e^{-2\!\int_0^t\!\beta(s)ds})}\,d\mathbf{W}$ | tighter variance than VP | $\mathcal{N}(0,I)$ |
VP is the OU with a time-varying coefficient — the DDPM forward process in continuous time. VE is pure noise addition with a growing variance, as in Song & Ermon's original noise-conditional score network (NCSN). Sub-VP was introduced for slightly tighter likelihood bounds.
Training minimizes a denoising score-matching loss weighted across noise levels: $$\mathcal{L}(\theta) = \mathbb{E}_{t\sim\mathcal{U}[0,T]}\,\mathbb{E}_{\mathbf{x}_0}\,\mathbb{E}_{\mathbf{x}_t\mid\mathbf{x}_0}\!\left[\lambda(t)\,\|\mathbf{s}_\theta(\mathbf{x}_t,t) - \nabla_{\mathbf{x}_t}\log p_{0t}(\mathbf{x}_t\mid\mathbf{x}_0)\|^2\right].$$ Common choices: $\lambda(t) = g(t)^2$ (likelihood weighting, giving an exact ELBO) or $\lambda(t) = 1$ (the "simple" DDPM loss, better FID).
Proof sketch. Write Fokker–Planck as $\partial_t p_t = -\nabla\!\cdot\!(\mathbf{v}_t p_t)$ with effective velocity $\mathbf{v}_t(\mathbf{x}) = \mathbf{f}(\mathbf{x},t) - \tfrac{1}{2}g(t)^2\nabla\log p_t(\mathbf{x})$ (use $\tfrac{1}{2}g^2\nabla^2 p = \tfrac{1}{2}g^2\nabla\cdot(p\nabla\log p)$). Any continuity equation of this form is the density transport induced by the ODE $\dot{\mathbf{x}} = \mathbf{v}_t(\mathbf{x})$.
The denoising diffusion implicit model (DDIM; Song, Meng & Ermon, 2021) derives a non-Markovian family of samplers parameterized by $\eta\in[0,1]$: $$\mathbf{x}_{t-1} = \sqrt{\bar\alpha_{t-1}}\,\hat{\mathbf{x}}_0(\mathbf{x}_t) + \sqrt{1-\bar\alpha_{t-1}-\sigma_t^2}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t) + \sigma_t\,\mathbf{z}, \quad \mathbf{z}\sim\mathcal{N}(0,I),$$ with $\sigma_t = \eta\sqrt{\tilde\beta_t}$ and $\hat{\mathbf{x}}_0$ the Tweedie denoiser. Setting $\eta=1$ recovers DDPM; $\eta=0$ gives a deterministic map — the Euler discretization of the probability-flow ODE. Deterministic DDIM is:
Conditional diffusion trains $p_\theta(\mathbf{x}_0\mid c)$ for condition $c$ (class label, text embedding, image). The naive approach is to feed $c$ as an extra network input: $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t,c)$. But simply conditioning is not enough — text-to-image samples then tend to ignore the prompt.
Bayes' rule on densities: $\nabla\log p_t(\mathbf{x}\mid c) = \nabla\log p_t(\mathbf{x}) + \nabla\log p_t(c\mid\mathbf{x})$. Train a classifier $p_\phi(c\mid\mathbf{x}_t)$ on noisy inputs and replace the score with $$\mathbf{s}_\theta(\mathbf{x}_t,t) + w\,\nabla_{\mathbf{x}_t}\log p_\phi(c\mid\mathbf{x}_t),$$ where $w>1$ exaggerates the classifier's effect. Strong but requires a separate noisy-image classifier.
Train a single model on both conditional and unconditional objectives, dropping $c$ to $\emptyset$ with probability $p_{\text{drop}}\approx 10\%$. At sampling time, combine: $$\tilde{\boldsymbol{\epsilon}}_\theta(\mathbf{x}_t,t,c) = (1+w)\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t,c) - w\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t,\emptyset).$$ Interpreting via Bayes, this corresponds to sampling from a sharpened conditional $$\tilde p_t(\mathbf{x}\mid c) \propto p_t(\mathbf{x}\mid c)^{1+w}\,p_t(\mathbf{x})^{-w}.$$ Classifier-free guidance (CFG) is the dominant conditioning mechanism in Stable Diffusion, Imagen, DALL·E 3, Midjourney. Typical scales $w\in[5,15]$ for text-to-image; higher values trade diversity for fidelity to the prompt.
Running diffusion at $512^2{\times}3$ pixels is expensive. Latent diffusion models (LDMs; Rombach et al., 2022 — the Stable Diffusion architecture) push the problem into a perceptually compressed latent space.
Residual ConvBlocks + self-attention at lower resolutions (16², 8²); sinusoidal time embedding added to each block.
Query = spatial feature, Key/Value = text-encoder output (CLIP or T5). Injects the prompt at every resolution.
Replace the U-Net with a pure transformer on patched latents; scales cleanly to high capacity (Peebles & Xie, 2023).
Freeze the base U-Net, train a parallel encoder that takes auxiliary inputs (edges, depth, pose) and is summed into the skip connections.
Flow matching (Lipman et al., 2023) generalizes diffusion as learning a time-dependent vector field $\mathbf{v}_\theta(\mathbf{x},t)$ that generates a prescribed probability path $\{p_t\}_{t\in[0,1]}$ from noise $p_0$ to data $p_1$. The ODE $d\mathbf{x}/dt = \mathbf{v}_\theta(\mathbf{x},t)$ transports samples.
The simplest choice: $\mathbf{x}_t = (1-t)\mathbf{x}_0 + t\mathbf{x}_1$ with $\pi = p_0\otimes p_{\text{data}}$. Target field is the constant $\mathbf{x}_1 - \mathbf{x}_0$. This is rectified flow (Liu, Gong & Liu, 2022) and the basis of Stable Diffusion 3 and Flux. Advantages: straight trajectories → few-step sampling, simpler conceptual picture than DDPM.
Diffusion is a special case of flow matching with a particular (Gaussian) probability path. Specifically, the VP-SDE marginals can be written as $\mathbf{x}_t = \alpha_t \mathbf{x}_1 + \sigma_t \mathbf{x}_0$ with $\mathbf{x}_0\sim\mathcal{N}(0,I)$ and $\alpha_t^2 + \sigma_t^2 = 1$; the probability-flow ODE velocity equals the conditional-flow-matching target under this path. Training a score model or a flow-matching model is largely a matter of parametrization.
Sampling from a trained diffusion model means integrating an SDE or ODE from $t=T$ down to $t=0$. Quality is measured in both image fidelity (FID, CLIP score) and compute — typically reported as number of function evaluations (NFE), i.e. forward passes of the U-Net.
Reverse SDE, Euler–Maruyama. $T\approx 1000$ steps. Strong quality, high NFE.
Euler on the probability-flow ODE. 20–50 NFE. Deterministic and invertible.
Karras et al. (2022): Heun's second-order predictor–corrector on a rescaled VE-SDE. Around 35 NFE for near-state-of-the-art FID.
Lu et al. (2022): exponential-integrator solvers exploiting the semilinear structure of VP/VE ODEs. 10–20 NFE.
Song et al. (2023): train a one-step model that maps any noise level directly to $\mathbf{x}_0$. Single-step (or few-step) generation at reasonable quality.
Progressive / adversarial distillation (Salimans & Ho, 2022; SDXL-Turbo): compress a $N$-step teacher into a $N/2$-step or $1$-step student.
Base case. For $t=1$: $q(\mathbf{x}_1\mid\mathbf{x}_0) = \mathcal{N}(\sqrt{\alpha_1}\mathbf{x}_0,\beta_1 I) = \mathcal{N}(\sqrt{\bar\alpha_1}\mathbf{x}_0,(1-\bar\alpha_1)I)$ since $\bar\alpha_1=\alpha_1$ and $\beta_1 = 1-\alpha_1$.
Induction. Assume $\mathbf{x}_{t-1} = \sqrt{\bar\alpha_{t-1}}\mathbf{x}_0 + \sqrt{1-\bar\alpha_{t-1}}\boldsymbol{\epsilon}_1$ with $\boldsymbol{\epsilon}_1\sim\mathcal{N}(0,I)$. Then $\mathbf{x}_t = \sqrt{\alpha_t}\mathbf{x}_{t-1}+\sqrt{\beta_t}\boldsymbol{\epsilon}_2$ with independent $\boldsymbol{\epsilon}_2\sim\mathcal{N}(0,I)$, so $$\mathbf{x}_t = \sqrt{\alpha_t\bar\alpha_{t-1}}\mathbf{x}_0 + \sqrt{\alpha_t(1-\bar\alpha_{t-1})}\boldsymbol{\epsilon}_1 + \sqrt{\beta_t}\boldsymbol{\epsilon}_2.$$ The last two independent Gaussians combine into a single Gaussian with variance $\alpha_t(1-\bar\alpha_{t-1}) + \beta_t = \alpha_t - \alpha_t\bar\alpha_{t-1} + 1 - \alpha_t = 1-\bar\alpha_t$. And $\alpha_t\bar\alpha_{t-1}=\bar\alpha_t$. Hence $\mathbf{x}_t = \sqrt{\bar\alpha_t}\mathbf{x}_0 + \sqrt{1-\bar\alpha_t}\boldsymbol{\epsilon}$, completing the induction.
By Bayes, $q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0) \propto q(\mathbf{x}_t\mid\mathbf{x}_{t-1})\,q(\mathbf{x}_{t-1}\mid\mathbf{x}_0)$ — the Markov property eliminates $\mathbf{x}_0$ from the first factor. Both factors are Gaussian, so the product is Gaussian.
Log-density ignoring constants: $$-\tfrac{1}{2\beta_t}\|\mathbf{x}_t - \sqrt{\alpha_t}\mathbf{x}_{t-1}\|^2 - \tfrac{1}{2(1-\bar\alpha_{t-1})}\|\mathbf{x}_{t-1} - \sqrt{\bar\alpha_{t-1}}\mathbf{x}_0\|^2.$$ Collecting the quadratic term in $\mathbf{x}_{t-1}$: $-\tfrac{1}{2}\big(\tfrac{\alpha_t}{\beta_t} + \tfrac{1}{1-\bar\alpha_{t-1}}\big)\|\mathbf{x}_{t-1}\|^2$. The coefficient simplifies using $\alpha_t(1-\bar\alpha_{t-1})+\beta_t = 1-\bar\alpha_t$ to give precision $\frac{1-\bar\alpha_t}{\beta_t(1-\bar\alpha_{t-1})}$, i.e. variance $\tilde\beta_t = \frac{\beta_t(1-\bar\alpha_{t-1})}{1-\bar\alpha_t}$.
The linear term is $\big(\tfrac{\sqrt{\alpha_t}}{\beta_t}\mathbf{x}_t + \tfrac{\sqrt{\bar\alpha_{t-1}}}{1-\bar\alpha_{t-1}}\mathbf{x}_0\big)^\top\mathbf{x}_{t-1}$. Multiply by $\tilde\beta_t$: $$\tilde{\boldsymbol{\mu}}_t = \frac{\sqrt{\alpha_t}(1-\bar\alpha_{t-1})}{1-\bar\alpha_t}\mathbf{x}_t + \frac{\sqrt{\bar\alpha_{t-1}}\,\beta_t}{1-\bar\alpha_t}\mathbf{x}_0.$$ This is the mean of the tractable posterior, a convex combination of the noisy sample and the clean one.
$q(\mathbf{x}_t\mid\mathbf{x}_0) = \mathcal{N}(\sqrt{\bar\alpha_t}\mathbf{x}_0, (1-\bar\alpha_t)I)$, so $\log q = -\tfrac{1}{2(1-\bar\alpha_t)}\|\mathbf{x}_t-\sqrt{\bar\alpha_t}\mathbf{x}_0\|^2 + \text{const}$, and $$\nabla_{\mathbf{x}_t}\log q = -\frac{\mathbf{x}_t - \sqrt{\bar\alpha_t}\mathbf{x}_0}{1-\bar\alpha_t} = -\frac{\sqrt{1-\bar\alpha_t}\,\boldsymbol{\epsilon}}{1-\bar\alpha_t} = -\frac{\boldsymbol{\epsilon}}{\sqrt{1-\bar\alpha_t}},$$ using $\mathbf{x}_t = \sqrt{\bar\alpha_t}\mathbf{x}_0 + \sqrt{1-\bar\alpha_t}\boldsymbol{\epsilon}$. Taking the conditional expectation over $\mathbf{x}_0\mid\mathbf{x}_t$ gives the marginal score, and the model's noise prediction is trained to match the posterior mean of $\boldsymbol{\epsilon}$, so $\mathbf{s}_\theta = -\boldsymbol{\epsilon}_\theta/\sqrt{1-\bar\alpha_t}$.
Let $p_{X_0}$ be the density of $X_0$ and $\phi_\sigma$ the Gaussian kernel with variance $\sigma^2$. Then $$p_X(x) = \int p_{X_0}(x_0)\,\phi_\sigma(x-x_0)\,dx_0.$$ Differentiating under the integral and using $\phi_\sigma'(u) = -\frac{u}{\sigma^2}\phi_\sigma(u)$, $$p_X'(x) = \int p_{X_0}(x_0)\cdot\!\left(-\tfrac{x-x_0}{\sigma^2}\right)\phi_\sigma(x-x_0)\,dx_0 = -\tfrac{1}{\sigma^2}\big(x\,p_X(x) - \int x_0\,p_{X,X_0}(x,x_0)dx_0\big).$$ Dividing by $p_X(x)$: $p_X'(x)/p_X(x) = -\tfrac{1}{\sigma^2}\big(x - \mathbb{E}[X_0\mid X=x]\big)$, i.e. $\mathbb{E}[X_0\mid X=x] = x + \sigma^2\,p_X'(x)/p_X(x) = x + \sigma^2\,\partial_x\log p_X(x)$. The MMSE denoiser is the identity plus the score times the noise power.
KL of two Gaussians with equal covariance $\Sigma = \tilde\beta_t I$: $D_{\mathrm{KL}} = \frac{1}{2\tilde\beta_t}\|\tilde{\boldsymbol{\mu}}_t - \boldsymbol{\mu}_\theta\|^2$.
From Ex 7, $\tilde{\boldsymbol{\mu}}_t$ can also be written in terms of $\mathbf{x}_t$ and $\boldsymbol{\epsilon}$ (substitute $\mathbf{x}_0 = (\mathbf{x}_t - \sqrt{1-\bar\alpha_t}\boldsymbol{\epsilon})/\sqrt{\bar\alpha_t}$): $$\tilde{\boldsymbol{\mu}}_t = \frac{1}{\sqrt{\alpha_t}}\!\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\boldsymbol{\epsilon}\right).$$ Hence $\tilde{\boldsymbol{\mu}}_t - \boldsymbol{\mu}_\theta = \frac{\beta_t}{\sqrt{\alpha_t}\sqrt{1-\bar\alpha_t}}(\boldsymbol{\epsilon}_\theta - \boldsymbol{\epsilon})$, and $$L_{t-1} = \frac{\beta_t^2}{2\alpha_t(1-\bar\alpha_t)\tilde\beta_t}\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_\theta\|^2.$$ Dropping the time-dependent prefactor leaves $L_{\text{simple}} = \mathbb{E}\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_\theta\|^2$. Empirically, the uniform weighting improves perceptual quality: the ELBO weights over-emphasize very high-noise steps which contribute little to perceptual detail.
The SDE's Fokker–Planck equation: $\partial_t p_t = -\nabla\!\cdot\!(\mathbf{f} p_t) + \tfrac{1}{2}g^2 \nabla^2 p_t$. Rewrite the Laplacian as a divergence using $\nabla^2 p = \nabla\cdot(\nabla p) = \nabla\cdot(p\,\nabla\log p)$: $$\partial_t p_t = -\nabla\!\cdot\!(\mathbf{f} p_t) + \tfrac{1}{2}g^2\nabla\!\cdot\!(p_t \nabla\log p_t) = -\nabla\!\cdot\!\big((\mathbf{f} - \tfrac{1}{2}g^2 \nabla\log p_t) p_t\big).$$ The right-hand side is the continuity equation for the deterministic flow $\dot{\mathbf{x}} = \mathbf{f}(\mathbf{x},t) - \tfrac{1}{2}g^2\nabla\log p_t(\mathbf{x})$. Both the ODE and SDE therefore evolve the same density $p_t$, even though individual trajectories differ.
On the training path, $\mathbf{x}_{t-1} = \sqrt{\bar\alpha_{t-1}}\mathbf{x}_0 + \sqrt{1-\bar\alpha_{t-1}}\boldsymbol{\epsilon}$ for some standard Gaussian $\boldsymbol{\epsilon}$. If we want the step from $\mathbf{x}_t$ to $\mathbf{x}_{t-1}$ to be a deterministic map that respects the right marginals, plug in the model's posterior estimates:
Then $\mathbf{x}_{t-1} = \sqrt{\bar\alpha_{t-1}}\hat{\mathbf{x}}_0 + \sqrt{1-\bar\alpha_{t-1}}\boldsymbol{\epsilon}_\theta$. This is the deterministic DDIM update. Adding stochasticity $\sigma_t \mathbf{z}$ recovers the general $\eta$-parametrized DDIM family and, at $\sigma_t = \sqrt{\tilde\beta_t}$, the DDPM sampler.
Using $\mathbf{s} = -\boldsymbol{\epsilon}/\sqrt{1-\bar\alpha_t}$, the same linear combination of scores is $$\tilde{\mathbf{s}}(\mathbf{x}) = (1+w)\,\nabla\log p_t(\mathbf{x}\mid c) - w\,\nabla\log p_t(\mathbf{x}) = \nabla\log\big[p_t(\mathbf{x}\mid c)^{1+w}\,p_t(\mathbf{x})^{-w}\big].$$ So the guided sampler's score is the gradient of $\log\tilde p_t$ with $\tilde p_t \propto p_t(\mathbf{x}\mid c)^{1+w}p_t(\mathbf{x})^{-w}$. Using Bayes $p_t(\mathbf{x}\mid c) \propto p_t(c\mid\mathbf{x})p_t(\mathbf{x})$, we get $\tilde p_t(\mathbf{x}\mid c) \propto p_t(c\mid\mathbf{x})^{1+w}p_t(\mathbf{x})$ — a sharpened conditional, stretching the likelihood ratio. Large $w$ collapses the sample toward the conditional mode, at the cost of diversity.
$\mathbf{u}_t = d\mathbf{x}_t/dt = \mathbf{x}_1 - \mathbf{x}_0$, constant in $t$. The CFM loss becomes $$\mathcal{L}_{\text{CFM}}(\theta) = \mathbb{E}_{t\sim\mathcal{U}[0,1]}\,\mathbb{E}_{\mathbf{x}_0,\mathbf{x}_1}\big\|\mathbf{v}_\theta((1-t)\mathbf{x}_0+t\mathbf{x}_1, t) - (\mathbf{x}_1-\mathbf{x}_0)\big\|^2.$$ This is the rectified-flow training objective. Sampling: integrate $\dot{\mathbf{x}}=\mathbf{v}_\theta(\mathbf{x},t)$ from $\mathbf{x}(0)\sim\mathcal{N}(0,I)$ to $t=1$. Compared to DDPM-style parametrizations, the target field is constant along the trajectory, which empirically makes few-step sampling much easier — the path is approximately a straight line.
Here $\mathbf{f}(\mathbf{x},t) = -\tfrac{1}{2}\beta(t)\mathbf{x}$ and $g(t) = \sqrt{\beta(t)}$. The reverse SDE (Anderson): $$d\mathbf{x} = \!\left[-\tfrac{1}{2}\beta(t)\mathbf{x} - \beta(t)\mathbf{s}_\theta(\mathbf{x},t)\right]\!dt + \sqrt{\beta(t)}\,d\bar{\mathbf{W}}_t.$$ Probability-flow ODE: replace the second term by half and drop the noise: $\dot{\mathbf{x}} = -\tfrac{1}{2}\beta(t)\mathbf{x} - \tfrac{1}{2}\beta(t)\mathbf{s}_\theta(\mathbf{x},t)$.
Substitute $\mathbf{s}_\theta = -\boldsymbol{\epsilon}_\theta/\sqrt{1-\bar\alpha_t}$ (with $\bar\alpha_t = e^{-\int_0^t\beta(s)ds}$ in continuous time): $$d\mathbf{x} = \!\left[-\tfrac{1}{2}\beta(t)\mathbf{x} + \frac{\beta(t)}{\sqrt{1-\bar\alpha_t}}\boldsymbol{\epsilon}_\theta(\mathbf{x},t)\right]\!dt + \sqrt{\beta(t)}\,d\bar{\mathbf{W}}_t.$$ This is the continuous-time counterpart of the DDPM ancestral sampler. Discretizing the corresponding ODE with Euler at non-uniform time grid reproduces deterministic DDIM.