Reinforcement Learning · Field Notes

§ I

Foundations

Formalism · 0.1

Markov Decision Process

The frame on which every RL algorithm is painted.

[ toggle ]

Definition

An MDP is a tuple $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$: $\mathcal{S}$ — state space $\mathcal{A}$ — action space $P(s' \mid s, a)$ — transition probabilities $R(s, a)$ — reward function $\gamma \in [0, 1)$ — discount factor Return: $G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k+1}$ State-value: $V^\pi(s) = \mathbb{E}_\pi[G_t \mid s_t = s]$ Action-value: $Q^\pi(s, a) = \mathbb{E}_\pi[G_t \mid s_t = s, a_t = a]$

Bellman Equations

Expectation (policy $\pi$): $$V^\pi(s) = \sum_a \pi(a|s) \Big[R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^\pi(s')\Big]$$ Optimality (for $Q^*$): $$Q^*(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) \max_{a'} Q^*(s', a')$$

Theory & intuition

Why Markov The Markov property — the future depends only on the present state, not on history — is what makes RL tractable. Without it, value functions wouldn't be well-defined on states alone; you'd have to condition on the entire trajectory, and everything that follows would collapse.

Why γ<1 The discount factor serves two purposes. Mathematically, it makes the Bellman operator a contraction, guaranteeing a unique fixed point and geometric convergence. Behaviourally, it encodes how much the agent values the distant future; $\gamma \to 1$ gives a far-sighted agent, $\gamma \to 0$ a myopic one.

V vs Q Both encode the same object, but answer different questions. $V(s)$ asks "how good is this state?"; $Q(s,a)$ asks "how good is this action in this state?" — and the latter is what you need for model-free control, because you can choose actions without simulating transitions.

Two problems RL really splits into two tasks: prediction (given $\pi$, find $V^\pi$) and control (find $\pi^*$). Every algorithm that follows is either one or the other — or, like policy iteration, both alternating.

Exercises

warm-upWrite bellman_expectation(V, P, R, gamma, pi) performing one synchronous sweep of the Bellman expectation operator. Test on a 3-state chain with a uniform-random policy.
Show solution
For each state, sum over actions weighted by $\pi(a|s)$, then over next states weighted by $P(s'|s,a)$:
```
def bellman_expectation(V, P, R, gamma, pi):
    V_new = np.zeros_like(V)
    for s in range(len(V)):
        for a in range(pi.shape[1]):
            V_new[s] += pi[s, a] * (R[s, a] + gamma * P[s, a] @ V)
    return V_new
```
Sanity check: under a uniform policy, your iterated values should converge to the closed-form solution $V^\pi = (I - \gamma P^\pi)^{-1} R^\pi$.
buildVerify the contraction property empirically. Initialise two random value functions $V_0, V_0'$, apply the Bellman optimality operator repeatedly to both, and plot $\|V_k - V_k'\|_\infty$. It should decay by factor $\gamma$ each iteration.
Show solution

Apply $T^*V(s) = \max_a [R(s,a) + \gamma P(\cdot|s,a)^\top V]$ to both vectors. Plot $\log\|V_k - V_k'\|_\infty$ versus $k$ — the slope is $\log \gamma$. This is the Banach contraction in action.

Subtle point: $T^*$ is a contraction in sup-norm, not L2. Use np.max(np.abs(...)), not np.linalg.norm.
stretchConstruct a 5-state MDP whose optimal policy under $\gamma = 0.5$ differs from the optimal policy under $\gamma = 0.99$. Solve both via policy iteration and inspect the policies.
Show solution

Idea: two paths from start, one giving a small immediate reward then loops, the other paying a small cost now and a big reward several steps later. Example: Action A yields $+1$ now and transitions to a self-loop at $0$. Action B yields $-0.5$ and after 3 steps reaches a $+10$ terminal state.

Under $\gamma = 0.5$: discounted $B$-reward is $\approx 0.5^3 \cdot 10 - 0.5 = 0.75$, but $A$-reward is $1$ — A wins. Under $\gamma = 0.99$: $B$'s discounted value $\approx 9.7 - 0.5$ dwarfs $A$'s $1$ — B wins.

Dynamic Programming · 0.2

Value Iteration

Requires a known model. Converges geometrically to $V^*$.

[ toggle ]

Update rule

V_{k+1}(s) = \max_a \Big[R(s,a) + \gamma \sum_{s'} P(s'|s,a) V_k(s')\Big]

Theory & intuition

Why it converges The Bellman optimality operator $T$ is a $\gamma$-contraction in the sup-norm: $\|TV - TV'\|_\infty \le \gamma \|V - V'\|_\infty$. By the Banach fixed-point theorem, repeatedly applying $T$ from any starting $V$ converges to the unique fixed point $V^*$ at a geometric rate $\gamma^k$.

One sentence Keep applying the Bellman optimality operator until the value function stops changing; then read off the greedy policy.

Cousin: policy iteration Alternates policy evaluation (solve the linear system for $V^\pi$ exactly) with policy improvement (act greedily w.r.t. $V^\pi$). Fewer iterations than value iteration but each one is heavier. In practice, modified policy iteration — a few sweeps of evaluation between improvements — often beats both.

Limitation Requires the model $(P, R)$. Everything below (SARSA, Q-learning, …) drops that requirement at the price of having to sample.

Exercises

warm-upRun value iteration on a 4×4 gridworld with reward $-1$ per step and $+10$ at the goal. Print $V^*$ as a grid and the greedy policy as arrows.
Show solution
Represent the grid as $P[s, a, s']$ with a deterministic transition for each action and a wall bounce back to $s$. After VI converges:
```
arrows = {0:'→', 1:'↓', 2:'←', 3:'↑'}
print(V.reshape(4, 4))
print(np.array([arrows[a] for a in policy]).reshape(4, 4))
```
Values should increase monotonically toward the goal along any optimal path.
buildImplement both synchronous and in-place (Gauss–Seidel) value iteration. Compare iterations-to-convergence on a 20×20 grid — in-place should win.
Show solution

Synchronous: compute $V_{k+1}$ from $V_k$ (hold a copy). In-place: overwrite $V$ directly during the sweep, so later states in the sweep already see updated values.

The in-place variant typically needs 30–50% fewer iterations because information propagates within a single sweep. Order matters: sweep from states near terminal rewards first for the biggest speedup.
stretchImplement Prioritised Sweeping: maintain a priority queue of states keyed by Bellman residual, always update the largest. Compare wall-clock convergence against plain value iteration.
Show solution

After updating state $s$, push every predecessor $s'$ (states with $P(s|s',a) > 0$) onto the heap with priority $|\Delta V(s)| \cdot \max_a P(s|s',a)$. Pop the largest, update it, repeat.

Use heapq with negated priorities for a max-heap. Moore & Atkeson (1993) is the original reference. On sparse-reward gridworlds, prioritised sweeping converges an order of magnitude faster in wall-clock time.

Python

import numpy as np

def value_iteration(P, R, gamma=0.99, theta=1e-6):
    """P[s,a,s'] transition probs; R[s,a] rewards."""
    n_states, n_actions, _ = P.shape
    V = np.zeros(n_states)
    while True:
        delta = 0
        for s in range(n_states):
            v_old = V[s]
            V[s] = max(R[s, a] + gamma * P[s, a] @ V
                       for a in range(n_actions))
            delta = max(delta, abs(v_old - V[s]))
        if delta < theta:
            break
    policy = np.argmax(
        [[R[s, a] + gamma * P[s, a] @ V for a in range(n_actions)]
         for s in range(n_states)], axis=1)
    return V, policy

§ II

Model-Free, Value-Based

On-policy · TD(0) · 2.1

SARSA

The letters spell the transition: $(s, a, r, s', a')$.

[ toggle ]

Update

Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \big[r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)\big]

Theory & intuition

The name The letters spell the transition tuple used in the update: $(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1})$. You need all five before you can apply the update, which is why SARSA is naturally online and step-by-step.

Why on-policy The bootstrap target uses $a_{t+1}$ actually sampled from the current behaviour policy (e.g. $\varepsilon$-greedy). So SARSA learns $Q^\pi$ for the policy being followed — including its exploration. Under GLIE schedules (Greedy in the Limit with Infinite Exploration), $\pi$ converges to $\pi^*$ and $Q$ to $Q^*$.

Cliff-walking The textbook demonstration: on a gridworld with a cliff, SARSA learns that $\varepsilon$-greedy exploration sometimes pushes it off the edge, so it takes a safer path further from the cliff. Q-learning (below) learns the optimal path along the cliff edge and falls off regularly during training.

When to prefer it When exploration itself is costly — learning on real hardware, safety-critical settings — SARSA's pessimism about its own randomness is a feature, not a bug.

Exercises

warm-upRun SARSA on the Sutton & Barto Cliff Walking env. Plot learning curves for $\varepsilon = 0.1$ and $\varepsilon = 0.01$.
Show solution

Use gymnasium's CliffWalking-v0. Track the episode return (should converge around $-17$ for the safe path at $\varepsilon = 0.1$, closer to the optimal $-13$ at $\varepsilon = 0.01$).

Key watch-out: SARSA's last action in an episode has no next action. Handle by using the terminal flag to zero-out the bootstrap term.
buildImplement Expected SARSA: replace $Q(s', a')$ in the target with $\mathbb{E}_{a' \sim \pi}[Q(s', a')]$. Compare the variance of TD errors against vanilla SARSA.
Show solution
```
pi = np.full(Q.shape[1], eps / Q.shape[1])
pi[Q[s2].argmax()] += 1 - eps
expected_q = pi @ Q[s2]
delta = r + gamma * expected_q - Q[s, a]
```
Expected SARSA removes the variance from sampling $a'$. Track np.var(td_errors) over a rolling window — it should drop by ~20–40% depending on $\varepsilon$.
stretchAnneal $\varepsilon$ on a GLIE schedule ($\varepsilon_k = 1/k$). Show empirically that $Q$ converges to $Q^*$ (track $\|Q_k - Q_{k-1}\|_\infty$).
Show solution

GLIE = Greedy in the Limit with Infinite Exploration. $\varepsilon_k = 1/k$ satisfies both: every action still sampled infinitely often ($\sum \varepsilon_k = \infty$), but the policy converges to greedy.

Plot $\log \|Q_k - Q_{k-1}\|_\infty$ over episodes — it should trend toward zero. Compare against constant $\varepsilon = 0.1$ where this gap plateaus at a nonzero value.

Python

import numpy as np

def eps_greedy(Q, s, eps):
    if np.random.rand() < eps:
        return np.random.randint(Q.shape[1])
    return int(np.argmax(Q[s]))

def sarsa(env, episodes=1000, alpha=0.1, gamma=0.99, eps=0.1):
    Q = np.zeros((env.n_states, env.n_actions))
    for _ in range(episodes):
        s = env.reset()
        a = eps_greedy(Q, s, eps)
        done = False
        while not done:
            s2, r, done = env.step(a)
            a2 = eps_greedy(Q, s2, eps)
            Q[s, a] += alpha * (r + gamma * Q[s2, a2] - Q[s, a])
            s, a = s2, a2
    return Q

Off-policy · TD(0) · 2.2

Q-Learning

Bootstraps off the greedy next action, not the behavioural one.

[ toggle ]

Update

Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \big[r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t)\big]

Theory & intuition

History Introduced by Watkins in his 1989 Cambridge PhD thesis, Q-learning was the first proven off-policy TD control algorithm — a small equation with an outsized legacy.

Why off-policy The target $r + \gamma \max_{a'} Q(s', a')$ evaluates the greedy policy regardless of how you actually explored. So you can follow $\varepsilon$-greedy (or anything else with sufficient coverage) and still converge to $Q^*$, provided every $(s,a)$ is visited infinitely often and step sizes satisfy Robbins–Monro conditions ($\sum \alpha = \infty$, $\sum \alpha^2 < \infty$).

Maximisation bias Because $\mathbb{E}[\max_a X_a] \ge \max_a \mathbb{E}[X_a]$, the $\max$ over noisy $Q$-estimates is systematically optimistic. In small tabular problems this is a rounding error; in deep RL it becomes a structural pathology that Double DQN was invented to fix.

One sentence Learn the value of behaving greedily, even while you don't.

Exercises

warm-upOn Cliff Walking, train SARSA and Q-learning with the same $\varepsilon$. Verify: SARSA takes the safe path far from the cliff; Q-learning takes the optimal-but-dangerous path along the edge.
Show solution

After training, run each policy greedily (no exploration). Trace the path both take from the start cell. Q-learning's path will skim the cliff edge (shortest); SARSA's will hug the top (safer from exploratory missteps during training).

During training, Q-learning's total reward is worse because it falls off the cliff often — this is the canonical illustration of why "on-policy value of an optimal policy" ≠ "value you collect while learning it."
buildUse a state-action-specific learning rate $\alpha_t(s,a) = 1 / N(s,a)$. Show that convergence is smoother than with constant $\alpha$.
Show solution
```
N = np.zeros_like(Q)
# inside the loop:
N[s, a] += 1
alpha = 1.0 / N[s, a]
Q[s, a] += alpha * (r + gamma * Q[s2].max() - Q[s, a])
```
This schedule satisfies Robbins–Monro ($\sum \alpha = \infty, \sum \alpha^2 < \infty$), unlike constant $\alpha$. Q-value curves should look much smoother, especially for rarely-visited $(s, a)$.
stretchImplement tabular Double Q-learning (two Q-tables, coin-flip which to update). Construct a noisy-reward MDP where standard Q-learning's max bias is visible, and plot the two methods' $\max_a Q(s, a)$ trajectories.
Show solution
Sutton & Barto's Hasselt's example: two states, "state B" has many actions each giving $\mathcal{N}(-0.1, 1)$ reward. True $Q$ is slightly negative, but vanilla Q-learning's $\max$ over noisy estimates reports strongly positive. Double Q-learning's estimate stays near zero.
```
if np.random.rand() < 0.5:
    a_star = Q1[s2].argmax()
    Q1[s, a] += alpha * (r + gamma * Q2[s2, a_star] - Q1[s, a])
else:
    a_star = Q2[s2].argmax()
    Q2[s, a] += alpha * (r + gamma * Q1[s2, a_star] - Q2[s, a])
```

Python

def q_learning(env, episodes=1000, alpha=0.1, gamma=0.99, eps=0.1):
    Q = np.zeros((env.n_states, env.n_actions))
    for _ in range(episodes):
        s = env.reset()
        done = False
        while not done:
            a = eps_greedy(Q, s, eps)
            s2, r, done = env.step(a)
            # bootstrap off greedy next action
            Q[s, a] += alpha * (r + gamma * Q[s2].max() - Q[s, a])
            s = s2
    return Q

Deep · Off-policy · 2.3

Deep Q-Network (DQN)

Q-learning with a neural net — stabilised by a replay buffer and a target network.

[ toggle ]

Loss

$$\mathcal{L}(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} \Big[\big(r + \gamma \max_{a'} Q_{\theta^-}(s', a') - Q_\theta(s, a)\big)^2\Big]$$ with replay buffer $\mathcal{D}$ and target params $\theta^-$ copied from $\theta$ every $N$ steps.

Theory & intuition

The breakthrough DQN's contribution wasn't "combine Q-learning with a neural net" — that had been tried. It was discovering how to make it stable. The 2015 Nature paper achieved superhuman play on 49 Atari games from pixels alone, using a single architecture and the same hyperparameters.

The deadly triad Function approximation + bootstrapping + off-policy learning can cause value estimates to diverge. DQN doesn't solve this in theory; it tames it in practice with two engineering tricks.

Replay buffer Uniformly sampling past transitions breaks the temporal correlation between consecutive samples. Without it, SGD sees a sequence of highly correlated mini-batches (all from the same episode), violating the i.i.d. assumption and causing catastrophic forgetting as the distribution shifts.

Target network If you bootstrap off the live network, changing $Q_\theta$ immediately moves the target $Q_\theta$ you're regressing toward — you're chasing your own tail. Freezing $\theta^- \leftarrow \theta$ every $N$ steps gives the optimizer a stationary target for the span of an update cycle.

Descendants Double DQN, Prioritised Replay, Dueling, Noisy Nets, Distributional (C51/QR-DQN), and n-step returns — all combined in Rainbow, which is still a strong baseline for discrete-action deep RL.

Exercises

warm-upTrain DQN on CartPole-v1: replay buffer 10k, target update every 500 steps, batch 64. Reach >450 average return over 100 evaluation episodes.
Show solution

Architecture: 2-layer MLP (64-64 ReLU). Adam, lr=1e-3. Linear $\varepsilon$ decay from 1.0 to 0.05 over 10k steps. Start learning after 1000 random steps to fill the buffer.

Common failure: target network updated too frequently → oscillation. If learning is unstable, increase the target-update interval to 1000 or use soft Polyak updates $\theta^- \leftarrow (1-\tau)\theta^- + \tau\theta$ with $\tau = 0.005$.
buildAdd Prioritised Experience Replay (proportional variant): sample with probability $\propto |\delta|^\alpha$, correct with importance-sampling weights. Compare sample-efficiency curves.
Show solution

Use a sum-tree for $O(\log n)$ proportional sampling (Schaul et al. 2015). Sampling probability: $p_i = (|\delta_i| + \varepsilon)^\alpha / \sum_j (|\delta_j| + \varepsilon)^\alpha$. Correction weight: $w_i = (N \cdot p_i)^{-\beta}$, normalised by $\max w$.

Use $\alpha = 0.6, \beta$ annealed from $0.4$ to $1.0$. Update priorities with fresh $|\delta|$ after every gradient step. Expect ~2× sample efficiency on Atari.
stretchTrain DQN on LunarLander-v2 with NoisyNets for exploration (factorised noisy linear layers, no $\varepsilon$-greedy). Show it still solves the environment.
Show solution

Replace the last nn.Linear with a NoisyLinear that has learnable $\mu_W, \sigma_W, \mu_b, \sigma_b$. Forward uses $W = \mu_W + \sigma_W \odot \varepsilon^W$ with factorised Gaussian noise $\varepsilon^W = f(\varepsilon^{\text{in}}) \otimes f(\varepsilon^{\text{out}})$, where $f(x) = \mathrm{sign}(x)\sqrt{|x|}$.

Resample the noise at each forward pass. The learned $\sigma$ values decrease over training as the policy commits — giving adaptive, state-dependent exploration for free.

PyTorch

import torch, torch.nn as nn, torch.nn.functional as F

class QNet(nn.Module):
    def __init__(self, obs_dim, n_actions):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, 128), nn.ReLU(),
            nn.Linear(128, 128),     nn.ReLU(),
            nn.Linear(128, n_actions))
    def forward(self, x): return self.net(x)

def dqn_update(q_net, target_net, batch, optim, gamma=0.99):
    s, a, r, s2, done = batch                        # tensors
    q_sa = q_net(s).gather(1, a.unsqueeze(1)).squeeze(1)
    with torch.no_grad():
        q_target = r + gamma * target_net(s2).max(1).values * (1 - done)
    loss = F.mse_loss(q_sa, q_target)
    optim.zero_grad(); loss.backward(); optim.step()
    return loss.item()

# periodically: target_net.load_state_dict(q_net.state_dict())

Deep · Variant · 2.4

Double DQN

Decouples action selection from evaluation to cut the overestimation bias.

[ toggle ]

Target

$$y = r + \gamma \, Q_{\theta^-}\big(s',\; \arg\max_{a'} Q_\theta(s', a')\big)$$ Online net picks the action; target net evaluates it.

Theory & intuition

The diagnosis Van Hasselt (2010, 2015) showed that DQN's $\max$ systematically overestimates: if each $Q(s', a)$ has mean-zero noise $\varepsilon_a$, then $\max_a (Q^* + \varepsilon_a)$ is upward-biased because the $\max$ selectively picks out the luckiest estimate. That bias propagates through bootstrapping.

The fix Decouple action selection from action evaluation. Use one network to pick the arg-max action, and a different network to estimate its value. Since the two networks have independent noise, the evaluation of the chosen action is no longer conditioned on it looking lucky.

Why it's almost free DQN already has two networks ($\theta$ online, $\theta^-$ target). Double DQN simply reuses them: online selects, target evaluates. No new parameters, no new hyperparameters, often dramatically more stable learning.

When it matters most Environments with many actions, sparse rewards, or long bootstrapping chains — exactly the settings where DQN's optimism compounds badly.

Exercises

warm-upModify your DQN to use Double DQN's decoupled max. On CartPole, plot $\mathbb{E}[\max_a Q(s,a)]$ over training for both methods — vanilla should trend noticeably higher.
Show solution
```
with torch.no_grad():
    a_star = q_net(s2).argmax(1, keepdim=True)       # online picks
    q_next = target_net(s2).gather(1, a_star).squeeze(1)  # target eval
    y = r + gamma * q_next * (1 - done)
```
Track q_net(s_batch).max(1).values.mean() every 1000 steps. Vanilla DQN's curve drifts upward well beyond the true optimal value (~475 on CartPole with $\gamma = 0.99$); Double DQN stays close to reality.
buildDesign a "fuzzy bandit" MDP where all true Q-values are zero but rewards are Gaussian. Show DQN's $Q$ estimates drift upward with training while Double DQN's stay calibrated.
Show solution

One state, $k$ actions, each giving $\mathcal{N}(0, 1)$ reward, episode ends immediately. True $Q^* = 0$ for all $a$. With just 10 actions, vanilla $\max$-estimate drifts to around $+0.5$ over training — pure noise amplified by optimism.

Double DQN requires two estimators; a clean setup uses replay + periodic target update as usual. Plot Q.mean() vs training steps for both — DQN climbs, Double DQN hovers around 0.
stretchImplement Clipped Double Q (TD3-style twin critics with $\min$) as a discrete-action variant. Compare DQN, Double DQN, and Clipped Double Q on a gridworld with very noisy rewards.
Show solution

Maintain two independent Q-networks $Q_1, Q_2$ with their own target nets. Target: $y = r + \gamma \min(Q^-_1(s', a^*), Q^-_2(s', a^*))$ where $a^* = \arg\max_a \tfrac{1}{2}(Q_1 + Q_2)(s', a)$.

Expect: Clipped Double Q gives the lowest bias but slightly underestimates (the $\min$ is pessimistic). On very noisy rewards, this trade is usually worth it — under-estimation is far less harmful than over-estimation since it doesn't compound through bootstrapping.

Drop-in replacement

with torch.no_grad():
    a_star   = q_net(s2).argmax(1, keepdim=True)         # select with online
    q_next   = target_net(s2).gather(1, a_star).squeeze(1)  # eval with target
    q_target = r + gamma * q_next * (1 - done)

Deep · Architecture · 2.5

Dueling DQN

Factor $Q$ into a state-value $V$ and an advantage $A$ stream.

[ toggle ]

Decomposition

$$Q(s, a) = V(s) + \Big(A(s, a) - \tfrac{1}{|\mathcal{A}|}\sum_{a'} A(s, a')\Big)$$ Subtracting the mean advantage resolves the identifiability of $V$ and $A$.

Theory & intuition

The observation Wang et al. (2016) noticed that in many states, the choice of action barely matters — standing still in a Breakout frame when the ball is far away has similar consequences regardless of which action you pick. Yet vanilla DQN must learn $Q(s, a)$ separately for every action, wasting capacity.

The decomposition Split the network: one head predicts $V(s)$ (how good is this state?), another predicts $A(s, a)$ (how much better is each action than average?). Then $Q(s, a) = V(s) + A(s, a)$.

The identifiability trick The decomposition is not unique: adding a constant to $V$ and subtracting it from $A$ gives the same $Q$. Subtracting the mean advantage, $A(s,a) - \tfrac{1}{|\mathcal{A}|}\sum_{a'} A(s, a')$, pins down $V$ as the average-action value and makes the decomposition unique up to the gauge.

Why it helps The $V$ stream receives gradient on every transition, not only on transitions taking action $a$. Sample efficiency improves markedly in environments with many actions or long periods where action choice is irrelevant.

Exercises

warm-upImplement the Dueling head (shared trunk → $V$ and $A$ streams → mean-subtraction). Pass a test batch and verify the output shape matches $Q(s, \cdot)$.
Show solution
```
def forward(self, x):
    h = self.trunk(x)              # (B, hidden)
    v = self.v_head(h)             # (B, 1)
    a = self.a_head(h)             # (B, |A|)
    return v + (a - a.mean(dim=1, keepdim=True))  # (B, |A|)
```
Sanity checks: output shape is (batch, n_actions); for any fixed $s$, $\sum_a (Q(s,a) - V(s)) = 0$ (mean-subtraction enforces this).
buildCombine Dueling with Double DQN on LunarLander-v2. Plot learning curves for DQN, Double DQN, and Dueling+Double DQN on the same axes.
Show solution

No new math needed — just swap your architecture to DuelingQNet and keep the Double DQN target. Both techniques are orthogonal and compose cleanly.

Run 5 seeds each and plot mean ± std. Dueling typically adds a modest but consistent bump on LunarLander (~5–10% in return, ~30% faster to convergence).
stretchVisualise what $V$ and $A$ learn on a gridworld: render $V(s)$ as one heatmap and $\max_a A(s,a) - \min_a A(s,a)$ as another. $V$ should highlight where to be; $A$ should highlight where the action choice matters.
Show solution

Train on a gridworld with walls. Then for every cell $s$, query the net and record v(s) and a_spread = A(s).max() - A(s).min().

Expect: $V$-heatmap shows a gradient toward the goal. $A$-spread heatmap is brightest at decision points (corridors that branch, cells next to walls) and dim in open rooms where all actions lead to similar outcomes. This is Dueling's intuition made visible.

PyTorch head

class DuelingQNet(nn.Module):
    def __init__(self, obs_dim, n_actions):
        super().__init__()
        self.trunk = nn.Sequential(
            nn.Linear(obs_dim, 128), nn.ReLU(),
            nn.Linear(128, 128),     nn.ReLU())
        self.v_head = nn.Linear(128, 1)
        self.a_head = nn.Linear(128, n_actions)

    def forward(self, x):
        h = self.trunk(x)
        v = self.v_head(h)                       # (B, 1)
        a = self.a_head(h)                       # (B, |A|)
        return v + (a - a.mean(dim=1, keepdim=True))

§ III

Policy Gradient & Actor-Critic

Monte-Carlo PG · 3.1

REINFORCE

Williams (1992). The plainest policy gradient estimator there is.

[ toggle ]

Policy gradient theorem

\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) \, G_t \right]$$ With a baseline $b(s_t)$ to reduce variance: $$\nabla_\theta J \approx \sum_t \nabla_\theta \log \pi_\theta(a_t | s_t) \big(G_t - b(s_t)\big)

Theory & intuition

Policy gradient theorem The gradient of expected return equals the expectation of the score function times the return: $\nabla_\theta J = \mathbb{E}[\nabla_\theta \log \pi_\theta(a|s) \cdot G]$. This is the log-derivative trick, and it's beautiful because it lets you estimate a gradient through a stochastic, possibly non-differentiable environment — you only need gradients of your own policy.

Why baselines are free Because $\mathbb{E}_a[\nabla_\theta \log \pi_\theta(a|s)] = 0$ for any fixed $s$, subtracting any function $b(s)$ from the return leaves the gradient unbiased but can dramatically reduce variance. This is the seed of every actor-critic method that follows.

Why it's unreliable REINFORCE is unbiased but its variance scales with trajectory length and stochasticity — sample-inefficient, and often needs very careful reward scaling and baselines to make progress. It works but rarely wins on standard benchmarks; its descendants do.

One sentence Push up the log-probability of actions whose returns beat the baseline; push down the rest.

Exercises

warm-upTrain REINFORCE on CartPole without a baseline. Plot the variance of the per-episode gradient norm over training — it should be uncomfortably large.
Show solution

Track grad_norm = sum((p.grad**2).sum() for p in policy.parameters()).sqrt() each episode. Plot variance over a rolling window (e.g. last 50 episodes).

You'll see variance climb with episode length — longer episodes mean more stochastic return, more gradient variance. This is exactly what baselines fix.
buildAdd a running-mean baseline (subtract a moving average of returns). Plot the new gradient-norm variance — it should drop significantly while the expected return is unchanged.
Show solution
```
baseline = 0.9 * baseline + 0.1 * returns.mean()
advantages = returns - baseline
```
Theory check: $\mathbb{E}_{a \sim \pi}[\nabla_\theta \log \pi(a|s) \cdot b] = b \cdot \nabla_\theta 1 = 0$ for any baseline not depending on $a$. So expected gradient is unchanged; variance drops. Often a 5–10× variance reduction.
stretchReplace the running-mean baseline with a learned state-value baseline $V_\phi(s)$ trained by MSE regression onto $G_t$. This is actor-critic without bootstrapping — compare it to A2C on the same env.
Show solution

Two networks: policy and value. Value loss: F.mse_loss(V(s), G). Advantage: $\hat A_t = G_t - V_\phi(s_t)$ (detach $V$ in the policy loss).

This is sometimes called Monte-Carlo actor-critic. Its advantage is unbiased (no bootstrap) but higher variance than A2C. The trade-off matters: on short-horizon envs MC-AC often beats A2C; on long-horizon ones it loses badly.

PyTorch

import torch, torch.nn as nn
from torch.distributions import Categorical

class Policy(nn.Module):
    def __init__(self, obs_dim, n_actions):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, 128), nn.Tanh(),
            nn.Linear(128, n_actions))
    def forward(self, x): return self.net(x)

def reinforce_update(policy, trajectory, optim, gamma=0.99):
    # trajectory: list of (state, action, reward)
    rewards = [r for *_, r in trajectory]
    # discounted returns
    G, returns = 0, []
    for r in reversed(rewards):
        G = r + gamma * G
        returns.insert(0, G)
    returns = torch.tensor(returns)
    returns = (returns - returns.mean()) / (returns.std() + 1e-8)  # baseline

    loss = 0
    for (s, a, _), Gt in zip(trajectory, returns):
        logits = policy(s)
        logp   = Categorical(logits=logits).log_prob(a)
        loss  += -logp * Gt
    optim.zero_grad(); loss.backward(); optim.step()

Actor-Critic · 3.2

A2C (Advantage Actor-Critic)

Replace the Monte-Carlo return with a learned critic's advantage.

[ toggle ]

Objectives

Advantage: $A_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)$ Actor loss: $\mathcal{L}_\pi = -\mathbb{E}[\log \pi_\theta(a_t|s_t) \, A_t] - \beta \, H[\pi_\theta]$ Critic loss: $\mathcal{L}_V = \mathbb{E}\big[(r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t))^2\big]$ The $H$ term is an entropy bonus that keeps the policy exploring.

Theory & intuition

Bias–variance trade REINFORCE uses the full Monte-Carlo return $G_t$ (unbiased, high variance). Replacing it with the one-step advantage $r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)$ introduces bias (the critic is wrong) but slashes variance (you no longer depend on every future reward). Actor-critic methods live on this spectrum.

GAE Generalised Advantage Estimation interpolates between $n$-step returns via a parameter $\lambda \in [0, 1]$: $\hat A^{\text{GAE}(\lambda)}_t = \sum_{k=0}^{\infty} (\gamma\lambda)^k \delta_{t+k}$ where $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$. $\lambda = 1$ recovers Monte-Carlo, $\lambda = 0$ the 1-step TD. $\lambda \approx 0.95$ is a common sweet spot.

Why entropy bonus Without the $-\beta H[\pi]$ term, the policy tends to collapse deterministically too quickly — locking in whichever action looked best early and refusing to explore. The entropy bonus keeps the policy soft until enough evidence has accumulated.

A2C vs A3C A3C (Mnih 2016) ran many actors asynchronously on CPU, each updating shared parameters. A2C is the synchronous version: wait for all actors to produce a batch, then update on GPU. A2C is usually as fast in wall-clock and far easier to debug.

Exercises

warm-upImplement synchronous A2C on CartPole with 8 parallel envs and 5-step rollouts. Hit >450 return consistently.
Show solution

Use gym.vector.SyncVectorEnv with 8 copies. Collect a $(T=5, N=8)$ rollout, bootstrap from $V(s_T)$ for non-terminal envs, compute 5-step returns backwards, then one gradient step on the full batch.

Key detail: reset rollout buffers after each update, not after each episode. Parallel envs reset independently when each terminates.
buildSwap the 1-step advantage for GAE with $\lambda = 0.95$. Sweep $\lambda \in \{0.0, 0.9, 0.95, 0.99, 1.0\}$ on LunarLander-v2 and identify the sweet spot.
Show solution

See the GAE recursion in §6.6: $\hat A_t = \delta_t + \gamma\lambda\, \hat A_{t+1}$ computed backwards. Drop-in replacement for the 1-step advantage.

Expected curve: monotonic improvement from $\lambda=0$ to $\lambda \approx 0.95$, then slight drop at $\lambda=1$ (pure MC, high variance). The sweet spot is usually $0.95$–$0.97$.
stretchRun A2C on the sparse-reward MountainCar-v0. Without entropy bonus it will likely fail to explore — sweep the entropy coefficient $\beta$ and plot success rate vs $\beta$.
Show solution

Add -beta * dist.entropy().mean() to the actor loss. Sweep $\beta \in \{0, 0.001, 0.01, 0.05, 0.1\}$; run many seeds per setting since success on MountainCar is bimodal.

Expected: $\beta = 0$ almost always fails (policy collapses on random initial action). Sweet spot around $\beta = 0.01$–$0.05$. Too high ($\beta = 0.1+$) and the policy never commits.

PyTorch (single update)

def a2c_update(actor, critic, rollout, opt_a, opt_c,
               gamma=0.99, ent_coef=0.01):
    s, a, r, s2, done = rollout
    with torch.no_grad():
        target = r + gamma * critic(s2).squeeze(-1) * (1 - done)
    v      = critic(s).squeeze(-1)
    adv    = (target - v).detach()

    logits = actor(s)
    dist   = Categorical(logits=logits)
    logp   = dist.log_prob(a)
    entropy = dist.entropy().mean()

    actor_loss  = -(logp * adv).mean() - ent_coef * entropy
    critic_loss = F.mse_loss(v, target)

    opt_a.zero_grad(); actor_loss.backward();  opt_a.step()
    opt_c.zero_grad(); critic_loss.backward(); opt_c.step()

Trust Region · 3.3

TRPO

Constrained policy update via a KL trust region.

[ toggle ]

Constrained problem

\max_\theta \; \mathbb{E}_{s,a \sim \pi_{\theta_{\mathrm{old}}}} \left[ \frac{\pi_\theta(a|s)}{\pi_{\theta_{\mathrm{old}}}(a|s)} A^{\pi_{\mathrm{old}}}(s,a) \right]$$ $$\text{s.t.} \quad \mathbb{E}_s \big[ D_{\mathrm{KL}}\big(\pi_{\theta_{\mathrm{old}}}(\cdot|s) \,\|\, \pi_\theta(\cdot|s)\big) \big] \le \delta

Theory & intuition

Motivation Naive policy-gradient steps can destroy a good policy in a single bad update. Once the policy changes, the on-policy data you collected is stale, and recovery is often slow or impossible. TRPO (Schulman 2015) asks: what's the largest step we can safely take?

The theorem Kakade & Langford's monotonic improvement result says that if the new policy stays within a bounded total-variation distance of the old one, expected return is guaranteed to improve (up to a correctable penalty). TRPO replaces total-variation with KL for tractability: $\bar D_{\mathrm{KL}}(\pi_{\text{old}} \| \pi) \le \delta$.

Why KL, not Euclidean Small parameter changes can cause huge policy changes (think tanh near saturation). KL measures distance in the space of distributions — the thing the theorem actually needs. The Fisher information matrix defines the local geometry of that space, which is why TRPO ends up computing natural gradients.

How it's solved Linearise the objective, quadraticise the constraint, solve $F x = g$ for the natural-gradient direction via conjugate gradients (avoids materialising $F$), scale the step to saturate the KL constraint, and finish with a backtracking line search that verifies both the improvement and the KL bound.

Why PPO replaced it TRPO is heavy (second-order machinery, Hessian-vector products, line search) and its guarantees are approximate anyway. PPO achieves similar effects with first-order SGD — cheaper, simpler, scales better.

Exercises

warm-upImplement a conjugate-gradient solver for $F x = g$. Verify on a small SPD matrix $F$ where you can compare against np.linalg.solve.

Show solution

def cg(Ax_fn, b, iters=10, tol=1e-10):
    x = np.zeros_like(b)
    r = b.copy(); p = r.copy()
    rr_old = r @ r
    for _ in range(iters):
        Ap = Ax_fn(p)
        alpha = rr_old / (p @ Ap)
        x += alpha * p
        r -= alpha * Ap
        rr_new = r @ r
        if rr_new < tol: break
        p = r + (rr_new / rr_old) * p
        rr_old = rr_new
    return x

CG converges in at most $n$ steps for an $n \times n$ SPD system. 10 iterations is usually plenty for neural-net Fisher matrices.

buildImplement the Fisher-vector product using autodiff: compute $\nabla_\theta D_{\text{KL}}$, then the gradient of (grad $\cdot$ fixed vector) to get $Fv$ without materialising $F$.

Show solution

def fvp(v, states, damping=0.1):
    kl = kl_mean(old_policy, policy, states)
    grads = torch.autograd.grad(kl, policy.parameters(),
                                create_graph=True)
    flat_grads = torch.cat([g.view(-1) for g in grads])
    gv = (flat_grads * v).sum()
    grads2 = torch.autograd.grad(gv, policy.parameters())
    return torch.cat([g.contiguous().view(-1) for g in grads2]) \
           + damping * v

The damping term + damping * v regularises the linear system — without it, near-singular Fisher matrices give garbage directions.

stretchImplement the full TRPO update (CG direction + KL-constrained step size + backtracking line search) on a small MuJoCo env. Plot expected return per iteration — it should be (nearly) monotonically non-decreasing.
Show solution

Pipeline: (1) compute $g = \nabla_\theta L$; (2) CG solves $Fx = g$; (3) scale step to saturate $\tfrac{1}{2} x^\top F x \le \delta$: $x^* = \sqrt{2\delta / (x^\top F x)} \cdot x$; (4) line search: try $\theta + \alpha^k x^*$ for $k = 0, 1, 2, \ldots$ with $\alpha = 0.5$, accept if surrogate improves AND $\bar{\mathrm{KL}} \le \delta$.

If the line search fails every step, reject and keep $\theta$ — that's the "monotonic" part. Expect slow but rock-solid improvement.

Sketch

# Heavy machinery — usually you'd use stable-baselines3 or a library.
# The core steps of a TRPO update:
#
#   1. Compute surrogate advantage and its gradient g.
#   2. Use Fisher-vector products  F v = ∇²_θ KL · v  (conjugate gradients)
#      to solve  F x = g  → natural gradient direction x.
#   3. Scale step to satisfy  0.5 · xᵀ F x ≤ δ.
#   4. Backtracking line search: accept the largest α such that
#      surrogate improves and KL ≤ δ.
#
# In practice: prefer PPO below.

Clipped Surrogate · 3.4

PPO

TRPO's guarantee, with first-order simplicity. The default modern on-policy workhorse.

[ toggle ]

Clipped objective

Let $r_t(\theta) = \dfrac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\mathrm{old}}}(a_t|s_t)}$. Then $$\mathcal{L}^{\text{CLIP}}(\theta) = \mathbb{E}_t\Big[\min\big(r_t(\theta) A_t,\; \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t\big)\Big]$$ Typically combined with a value loss and entropy bonus, using GAE for $A_t$.

Theory & intuition

The central trick Instead of enforcing a KL constraint exactly, clip the probability ratio so that large policy updates can't produce outsized gradients. When $A_t > 0$ and the ratio $r_t$ has climbed past $1+\epsilon$, the gradient is zero — no incentive to push further. Symmetric treatment for $A_t < 0$ and $r_t < 1-\epsilon$.

Why it approximates a trust region Clipping caps the per-sample influence of any transition on the gradient, which in aggregate bounds how far the policy can move per epoch. It's a crude but remarkably effective substitute for TRPO's KL ball, using only first-order SGD.

The min is doing work The $\min$ in $\min(r_t A_t, \mathrm{clip}(r_t) A_t)$ forms a pessimistic lower bound on the surrogate objective. When clipping would make the objective larger than the unclipped version, the $\min$ rejects the clip — so you're only ever conservative, never optimistic.

In practice Combined with GAE-$\lambda$ for advantages, a shared or separate value network, and modest entropy bonus. Standard recipe: collect a batch, do ~10 epochs of mini-batch SGD on it, discard it, repeat. Robust across hyperparameters in a way few RL algorithms are.

Beyond games PPO is the on-policy workhorse behind most large-scale RL, including RLHF for LLMs. Its simplicity, stability, and tolerance for distributed training made it the default.

Exercises

warm-upImplement PPO on CartPole with the standard recipe: 4 parallel envs, 128-step rollouts, 4 epochs, 4 minibatches, $\epsilon = 0.2$, GAE $\lambda = 0.95$.
Show solution

Each outer iteration: collect $4 \times 128 = 512$ transitions, compute GAE and returns, then run 4 epochs × 4 minibatches of size 128 on the clipped loss + value loss + entropy bonus.

Typical coefficients: value 0.5, entropy 0.01, max grad norm 0.5, Adam lr 3e-4. CartPole should solve in ~20k env steps.
buildAdd value clipping: clip $V_{\text{new}}$ around $V_{\text{old}}$ by the same $\epsilon$. Ablate with/without on BipedalWalker-v3.
Show solution
```
v_clipped = v_old + (v_new - v_old).clamp(-eps, eps)
v_loss_1 = (v_new - returns).pow(2)
v_loss_2 = (v_clipped - returns).pow(2)
v_loss   = 0.5 * torch.max(v_loss_1, v_loss_2).mean()
```
The max (not min) mirrors the policy-side clip's pessimistic structure. Value clipping mildly helps stability on most tasks but is not always free — Engstrom et al. showed some envs prefer without.
stretchReproduce (a subset of) the Engstrom et al. Implementation Matters ablations: toggle observation normalisation, reward scaling, orthogonal init, and Adam $\epsilon$. Plot how each individually impacts final return.
Show solution

Maintain a running mean/std of observations (update only during data collection); normalise before feeding to nets. Scale rewards by $1 / \text{std}(\text{discounted returns})$ — prevents value-net blowup on high-reward tasks.

Orthogonal init ($\text{gain} = \sqrt{2}$ for hidden, $0.01$ for policy output, $1.0$ for value output). Adam $\epsilon = 10^{-5}$, not default $10^{-8}$. Each trick is worth ~10–50% on final return individually; together, they're what separates "PPO" from "a PPO that works."

PyTorch

def ppo_update(actor, critic, batch, opt_a, opt_c,
               clip=0.2, ent_coef=0.01, vf_coef=0.5, epochs=10):
    s, a, old_logp, adv, ret = batch
    adv = (adv - adv.mean()) / (adv.std() + 1e-8)

    for _ in range(epochs):
        logits = actor(s)
        dist   = Categorical(logits=logits)
        logp   = dist.log_prob(a)
        ratio  = torch.exp(logp - old_logp)

        surr1 = ratio * adv
        surr2 = torch.clamp(ratio, 1 - clip, 1 + clip) * adv
        policy_loss = -torch.min(surr1, surr2).mean()
        entropy     = dist.entropy().mean()

        v = critic(s).squeeze(-1)
        value_loss = F.mse_loss(v, ret)

        loss = policy_loss + vf_coef * value_loss - ent_coef * entropy
        opt_a.zero_grad(); opt_c.zero_grad()
        loss.backward()
        opt_a.step(); opt_c.step()

§ IV

Continuous Control

Off-policy · Deterministic · 4.1

DDPG

DPG + DQN tricks — an actor-critic for continuous action spaces.

[ toggle ]

Losses

Critic: $y = r + \gamma Q_{\phi^-}(s', \mu_{\theta^-}(s'))$, minimise $\big(Q_\phi(s,a) - y\big)^2$. Actor: $\nabla_\theta J \approx \mathbb{E}_s\big[\nabla_a Q_\phi(s,a)\big|_{a=\mu_\theta(s)} \nabla_\theta \mu_\theta(s)\big]$ Exploration via additive noise (Ornstein–Uhlenbeck or plain Gaussian).

Theory & intuition

The problem Q-learning in continuous action spaces is awkward: you can't tabulate $Q(s, a)$ over uncountably many $a$, and the $\max_a Q(s, a)$ step becomes an inner optimisation problem at every transition. DDPG sidesteps this by learning a deterministic policy $\mu_\theta(s)$ and treating it as "the argmax."

Deterministic Policy Gradient Silver et al. (2014) showed that for deterministic policies, the policy gradient has a clean form: $\nabla_\theta J = \mathbb{E}\big[\nabla_a Q_\phi(s, a)\big|_{a=\mu_\theta(s)} \nabla_\theta \mu_\theta(s)\big]$. You pass gradients from $Q$ through the action, into the actor — essentially asking "how should I move the action to make $Q$ go up?"

DQN tricks, ported Off-policy replay buffer, plus soft target updates $\theta^- \leftarrow \tau\theta + (1-\tau)\theta^-$ with $\tau \approx 0.005$ — a continuous Polyak averaging that's gentler than DQN's hard copies.

Exploration A deterministic policy doesn't explore on its own, so noise is added externally: originally Ornstein–Uhlenbeck (temporally correlated, good for physics), in practice often just Gaussian.

Caveat DDPG is notorious for being unstable and hyperparameter-sensitive. Q-values tend to explode, and seemingly minor hyperparameter changes can break it. TD3 (next) fixes most of this with surgical precision.

Exercises

warm-upImplement DDPG on Pendulum-v1. Use Gaussian exploration noise $\sigma = 0.1$ and Polyak averaging $\tau = 0.005$.
Show solution

Actor outputs $\mathrm{tanh}$-scaled action; during training, add $\mathcal{N}(0, \sigma)$ noise clipped to the action range. Replay buffer size 100k, batch 256. Adam with actor lr 1e-4, critic lr 1e-3 (critic gets higher lr).

Pendulum should solve (return > $-200$) in ~30k env steps.
buildReplace Gaussian noise with Ornstein–Uhlenbeck (temporally correlated) noise. Compare on Pendulum and on HalfCheetah — which domain benefits more from correlated noise?
Show solution
```
x_t = x_{t-1} + theta * (mu - x_{t-1}) * dt + sigma * sqrt(dt) * randn()
```
Typical: $\theta = 0.15, \sigma = 0.2, \mu = 0$. OU noise helps for momentum-heavy physics where single-step noise gets averaged out — HalfCheetah, Ant, Humanoid. On Pendulum (simpler), plain Gaussian is usually as good.
stretchDeliberately break DDPG: remove the target networks, or set $\tau = 1$ (hard copy every step). Plot the running max of $Q(s, \mu(s))$ over training — you should see it explode.
Show solution

Without a slowly-moving target, the Bellman equation $Q = r + \gamma Q$ becomes a fixed-point iteration on a moving target. Small biases compound: $\log Q_{\max}(t)$ grows roughly linearly for the first few thousand steps, then super-linearly — classic divergence signature.

Any stabiliser (target net, or even just a longer replay buffer) suffices to tame it. This exercise is a visceral demonstration of why every off-policy deep RL algorithm since DQN includes target networks.

PyTorch (core loop)

def ddpg_update(actor, critic, actor_t, critic_t, batch,
                opt_a, opt_c, gamma=0.99, tau=0.005):
    s, a, r, s2, done = batch
    with torch.no_grad():
        a2 = actor_t(s2)
        y  = r + gamma * critic_t(s2, a2).squeeze(-1) * (1 - done)

    # Critic step
    q = critic(s, a).squeeze(-1)
    c_loss = F.mse_loss(q, y)
    opt_c.zero_grad(); c_loss.backward(); opt_c.step()

    # Actor step (maximise Q)
    a_loss = -critic(s, actor(s)).mean()
    opt_a.zero_grad(); a_loss.backward(); opt_a.step()

    # Polyak averaging on targets
    for p, pt in zip(actor.parameters(),  actor_t.parameters()):
        pt.data.mul_(1 - tau).add_(tau * p.data)
    for p, pt in zip(critic.parameters(), critic_t.parameters()):
        pt.data.mul_(1 - tau).add_(tau * p.data)

Off-policy · 4.2

TD3

Three targeted fixes to DDPG's failure modes.

[ toggle ]

The three tricks

Clipped double Q: target uses $\min(Q_{\phi_1^-}, Q_{\phi_2^-})$ to curb overestimation. Target-policy smoothing: add clipped noise to $\mu_{\theta^-}(s')$ before evaluating $Q$. Delayed actor updates: update $\theta$ (and targets) once every $d$ critic steps.

y = r + \gamma \min_{i=1,2} Q_{\phi_i^-}\big(s',\; \mathrm{clip}(\mu_{\theta^-}(s') + \epsilon,\, a_{\min}, a_{\max})\big), \quad \epsilon \sim \mathrm{clip}(\mathcal{N}(0,\sigma), -c, c)

Theory & intuition

Diagnosis Fujimoto, van Hoof & Meger (2018) diagnosed three specific failure modes in DDPG and patched each. The result is the same sample-efficiency regime with dramatically better stability.

1. Clipped double-Q DDPG inherits Q-learning's overestimation bias, which is worse in continuous control because the actor actively exploits Q-errors. TD3 maintains two critics $Q_{\phi_1}, Q_{\phi_2}$ and uses $\min(Q_{\phi_1^-}, Q_{\phi_2^-})$ in the target — a pessimistic estimate, an analogue of Double Q-learning for continuous actions.

2. Target-policy smoothing The deterministic actor tends to find sharp peaks in the critic that are artifacts of function approximation. Adding clipped Gaussian noise to the target action before evaluating $Q$ acts as a regulariser: the critic must be smooth in a neighbourhood of $\mu_{\theta^-}(s')$, which it always is if the underlying $Q$-function is.

3. Delayed policy updates A bad critic produces a bad policy gradient. Updating the actor (and targets) only once every $d$ critic steps — typically $d = 2$ — lets the critic catch up before the actor exploits its errors. Simple, almost free, very effective.

Takeaway The same tricks that stabilise DQN (double estimators, target smoothing, delayed targets) are needed in continuous control too, sometimes even more urgently.

Exercises

warm-upImplement TD3 on Pendulum: twin critics, policy noise $\sigma = 0.2$, clip range $c = 0.5$, policy delay $d = 2$. Verify it outperforms your DDPG baseline.
Show solution

Two independent critic networks with separate optimisers. Actor update runs once every 2 critic updates. Target action: $\tilde a = \mathrm{clip}(\mu_{\theta^-}(s') + \mathrm{clip}(\mathcal{N}(0, \sigma), -c, c), a_{\min}, a_{\max})$.

TD3 typically adds 10–20% final performance over DDPG on Pendulum and cuts training-run failure rate roughly in half.
buildAblate the three tricks one at a time (remove twin critics; remove target-policy smoothing; remove delayed updates). Plot learning curves on Walker2d — which removal hurts most?
Show solution

Fujimoto et al. found that twin critics contribute most of the gain — they're what prevents the runaway overestimation DDPG suffers from. Target-policy smoothing matters next. Delayed updates are the cheapest fix but contribute least alone.

Run 5 seeds per config; results should show: all-three > no-delay > no-smoothing > no-twins ≈ DDPG.
stretchAdd per-dimension learnable exploration noise: make $\sigma$ a learnable parameter trained by gradient descent on the critic loss. Compare against fixed $\sigma$.
Show solution

Parameterise $\sigma = \text{softplus}(s)$ where $s$ is a per-action-dim learnable parameter. Add it to the action at collection time and its log-prob to an entropy bonus term in the actor loss (so it doesn't collapse to zero).

This is halfway to SAC — you're learning a Gaussian policy, just without the reparameterisation trick. Often matches fixed $\sigma$ on easy tasks, wins by a clear margin on task suites where optimal noise scale varies (Humanoid, Ant).

Target computation

with torch.no_grad():
    noise = (torch.randn_like(a) * policy_noise).clamp(-noise_clip, noise_clip)
    a2    = (actor_t(s2) + noise).clamp(act_low, act_high)
    q1_t  = critic1_t(s2, a2).squeeze(-1)
    q2_t  = critic2_t(s2, a2).squeeze(-1)
    y     = r + gamma * torch.min(q1_t, q2_t) * (1 - done)

# update both critics toward y; update actor only every `policy_delay` steps

Off-policy · Stochastic · 4.3

SAC (Soft Actor-Critic)

Maximum-entropy RL — the current go-to for continuous control.

[ toggle ]

Maximum-entropy objective

J(\pi) = \sum_t \mathbb{E}_{(s_t, a_t) \sim \pi}\big[r_t + \alpha \, H(\pi(\cdot | s_t))\big]$$ Soft Q target: $$y = r + \gamma \big(\min_{i=1,2} Q_{\phi_i^-}(s', \tilde a') - \alpha \log \pi_\theta(\tilde a' | s')\big), \quad \tilde a' \sim \pi_\theta(\cdot | s')$$ Actor loss: $$\mathcal{L}_\pi = \mathbb{E}_{s, a \sim \pi_\theta}\big[\alpha \log \pi_\theta(a|s) - \min_i Q_{\phi_i}(s, a)\big]

Theory & intuition

Maximum entropy RL Instead of maximising just expected return, SAC (Haarnoja 2018) maximises return plus the policy's entropy: $J(\pi) = \sum_t \mathbb{E}\big[r_t + \alpha \, H(\pi(\cdot | s_t))\big]$. The optimal policy is deliberately stochastic wherever multiple actions are near-equally good.

Why entropy is fundamental It solves two problems at once. (1) Exploration is baked into the objective, not bolted on as noise. (2) The policy becomes more robust to perturbations — if two actions have nearly the same $Q$, both get probability mass, so small Q-errors don't cascade into catastrophic deterministic choices.

Soft Bellman equation Under max-entropy RL, the optimal policy turns out to be a Boltzmann distribution of the soft $Q$: $\pi^*(a|s) \propto \exp(Q^*(s, a)/\alpha)$. This is mathematically elegant — it connects RL to energy-based models and to the softmax that's everywhere in ML.

Reparameterisation trick SAC's actor outputs parameters of a squashed-Gaussian ($\tanh$ of a Normal sample). Using $a = \tanh(\mu + \sigma \cdot \epsilon)$ with $\epsilon \sim \mathcal{N}(0, I)$ lets gradients flow through the sample — far lower variance than the score function estimator.

Automatic $\alpha$ Hand-tuning the temperature $\alpha$ is brittle. The second SAC paper learns it by dual descent on the constraint $\bar H(\pi) \ge H_{\text{target}}$ — choosing a target entropy (e.g. $-|\mathcal{A}|$) is far more intuitive than choosing $\alpha$ directly. This is why modern SAC "just works" out of the box.

Status SAC has largely displaced DDPG/TD3 as the default continuous-control algorithm: comparable or better sample efficiency, higher asymptotic performance, and vastly more stable across tasks.

Exercises

warm-upImplement SAC on Pendulum with a fixed $\alpha = 0.2$ and a tanh-squashed Gaussian policy. Remember the $\log(1 - \tanh^2)$ correction in the log-prob.
Show solution
```
mean, log_std = actor(s).split(act_dim, dim=-1)
std = log_std.clamp(-20, 2).exp()
normal = Normal(mean, std)
z = normal.rsample()                      # reparameterised
action = torch.tanh(z)
log_prob = normal.log_prob(z) - torch.log(1 - action**2 + 1e-6)
log_prob = log_prob.sum(-1, keepdim=True)
```
Without the tanh-Jacobian correction, $\log \pi$ is wrong and SAC diverges silently. The $+10^{-6}$ avoids log(0) at the boundary.
buildAdd automatic $\alpha$ tuning via dual gradient descent with target entropy $-|\mathcal{A}|$. Plot $\alpha$ over training — it should decrease as the policy sharpens.
Show solution
```
log_alpha = torch.zeros(1, requires_grad=True)
alpha_opt = torch.optim.Adam([log_alpha], lr=3e-4)
# inside update:
alpha = log_alpha.exp()
alpha_loss = -(log_alpha * (log_prob + target_entropy).detach()).mean()
alpha_opt.zero_grad(); alpha_loss.backward(); alpha_opt.step()
```
Target entropy convention: $-|\mathcal{A}|$ for continuous actions. $\alpha$ starts near 1 (exploratory), drops to $10^{-2}$–$10^{-3}$ as the policy commits. The dual form ensures $\bar H[\pi] \ge H_{\text{target}}$ as a soft constraint.
stretchDerive and implement Discrete SAC (for CartPole): replace the tanh-Gaussian with a Categorical policy, re-derive the actor loss in closed form (no reparameterisation needed).
Show solution
For discrete actions, $\pi(a|s)$ is a softmax, so you can compute the expectation in closed form without sampling:
```
probs = softmax(actor(s))                # (B, |A|)
log_probs = log_softmax(actor(s))
q = torch.min(q1(s), q2(s))              # (B, |A|)
pi_loss  = (probs * (alpha * log_probs - q)).sum(-1).mean()
# target: V(s') = (probs' * (q_target - alpha * log_probs')).sum(-1)
```
No reparameterisation, no log-det-Jacobian, cleaner code. Target entropy: $0.98 \cdot (-\log |\mathcal{A}|)$ is a typical choice.

PyTorch (update)

def sac_update(actor, q1, q2, q1_t, q2_t, alpha, batch,
               opt_q, opt_pi, gamma=0.99, tau=0.005):
    s, a, r, s2, done = batch

    # --- critic target (with entropy)
    with torch.no_grad():
        a2, logp2 = actor.sample(s2)             # reparameterised sample + log-prob
        q_t = torch.min(q1_t(s2, a2), q2_t(s2, a2)).squeeze(-1)
        y   = r + gamma * (q_t - alpha * logp2.squeeze(-1)) * (1 - done)

    # --- critics
    q_loss = F.mse_loss(q1(s, a).squeeze(-1), y) \
           + F.mse_loss(q2(s, a).squeeze(-1), y)
    opt_q.zero_grad(); q_loss.backward(); opt_q.step()

    # --- actor (soft policy improvement)
    a_pi, logp = actor.sample(s)
    q_pi = torch.min(q1(s, a_pi), q2(s, a_pi)).squeeze(-1)
    pi_loss = (alpha * logp.squeeze(-1) - q_pi).mean()
    opt_pi.zero_grad(); pi_loss.backward(); opt_pi.step()

    # polyak update of q1_t, q2_t as in DDPG
    return q_loss.item(), pi_loss.item()

§ V

Model-Based RL

Planning + Learning · 5.1

Dyna-Q

Sutton's 1990 template — learn a model, replay simulated transitions between real ones.

[ toggle ]

Procedure

Standard Q-learning update on the real transition: $$Q(s,a) \leftarrow Q(s,a) + \alpha\big[r + \gamma \max_{a'} Q(s',a') - Q(s,a)\big]$$ Also store it in a model: $\mathrm{Model}(s,a) \leftarrow (r, s')$. Then repeat $n$ planning steps: Sample a previously-observed $(s, a)$ Look up $(r, s') \leftarrow \mathrm{Model}(s, a)$ Apply the same Q-learning update

Theory & intuition

Two kinds of RL Sutton distinguishes direct RL (update $Q$ from real experience) and indirect RL (update $Q$ from simulated experience via a model). Dyna-Q does both, on the same $Q$-table, in a tight loop.

Why planning helps Each real transition is expensive (environment step); each planning step is cheap (model lookup). With $n = 50$ planning steps per real step, you get effectively 51× the $Q$-updates per environment interaction — an enormous sample-efficiency multiplier.

The model is trivial In the tabular case, the model is a dictionary — memorising each observed $(s, a) \to (r, s')$. This only works because tabular Dyna-Q assumes deterministic dynamics; for stochastic environments you'd store counts or distributions.

Dyna-Q+ Adds an exploration bonus $\kappa \sqrt{\tau(s,a)}$ where $\tau$ is time since last visit — encourages revisiting stale parts of the model, which matters in non-stationary environments where the true dynamics drift.

Legacy Dyna-Q is the conceptual ancestor of everything that follows. Replace "lookup table model" with "neural network model" and "tabular $Q$" with "policy gradient," and you have MBPO. The template hasn't changed in 35 years.

Exercises

warm-upRun Dyna-Q on a simple gridworld. Plot episodes to solve vs number of planning steps $n \in \{0, 5, 50\}$ — more planning should give dramatically faster learning.
Show solution

$n = 0$ is plain Q-learning. Expect ~50 episodes to solve. $n = 5$: ~15 episodes. $n = 50$: ~5 episodes. The gains saturate because a tabular model can only replay what's been seen — you still need real experience to find the goal.

This is Figure 8.2 in Sutton & Barto, chapter 8 — a nice visual proof that free planning is essentially always worth it when a model is available.
buildImplement Dyna-Q+ with the exploration bonus $\kappa \sqrt{\tau(s,a)}$. Build a gridworld where a wall opens up after 1000 steps and compare Dyna-Q vs Dyna-Q+ — only the latter should notice.
Show solution

Track $\tau(s, a)$ = steps since $(s, a)$ was last tried. Augment planning reward: $r \leftarrow r + \kappa \sqrt{\tau(s, a)}$ with $\kappa \approx 10^{-3}$. In the real environment, use the normal reward.

Vanilla Dyna-Q exploits the shortcut it found early, never retesting the blocked path. Dyna-Q+'s bonus grows on stale $(s,a)$ pairs, nudging the policy to re-probe — it finds the newly opened shortcut within a few hundred steps.
stretchReplace the tabular model with a small neural network trained on observed $(s, a, r, s')$ tuples. Use it in the planning loop and apply to CartPole — you've just built a baby MBPO.
Show solution

MLP: input $(s, a)$, output $(\Delta s, r)$. Train on a buffer of real transitions with MSE; predict $s' = s + \Delta s$ (residual parameterisation is much more stable than predicting $s'$ directly).

For planning, sample random seen states, step through the model for $k$ steps, feed updates to DQN's replay. Beware compounding error — keep $k$ small (3–5) at first.

Python

from collections import defaultdict
import random, numpy as np

def dyna_q(env, episodes=100, n_planning=50,
           alpha=0.1, gamma=0.99, eps=0.1):
    Q     = np.zeros((env.n_states, env.n_actions))
    model = {}                          # (s, a) -> (r, s')
    seen  = defaultdict(list)           # s -> [a, ...]

    for _ in range(episodes):
        s, done = env.reset(), False
        while not done:
            # ── direct RL on the real transition
            a = eps_greedy(Q, s, eps)
            s2, r, done = env.step(a)
            Q[s, a] += alpha * (r + gamma * Q[s2].max() - Q[s, a])

            # ── update model
            model[(s, a)] = (r, s2)
            if a not in seen[s]: seen[s].append(a)

            # ── indirect RL: n simulated updates
            for _ in range(n_planning):
                sp = random.choice(list(seen.keys()))
                ap = random.choice(seen[sp])
                rp, s2p = model[(sp, ap)]
                Q[sp, ap] += alpha * (rp + gamma * Q[s2p].max() - Q[sp, ap])

            s = s2
    return Q

Planning · 5.2

Monte Carlo Tree Search

Four-phase planning loop — select, expand, simulate, back up — guided by UCB.

[ toggle ]

UCB1 / UCT selection rule

At each internal node, pick the action maximising: $$a^*(s) = \arg\max_a \left[\hat Q(s,a) + c\sqrt{\frac{\ln N(s)}{N(s,a)}}\right]$$ After each simulation returning value $G$, update along the visited path: $N(s,a) \mathrel{+{=}} 1, \quad \hat Q(s,a) \mathrel{+{=}} (G - \hat Q(s,a)) / N(s,a)$ In AlphaZero, replace UCB1 with PUCT using a learned prior $P(a|s)$: $$a^*(s) = \arg\max_a \left[Q(s,a) + c \cdot P(a|s) \cdot \frac{\sqrt{N(s)}}{1 + N(s,a)}\right]$$

Theory & intuition

Four phases per iteration (1) Selection — descend the tree via UCB until reaching a leaf. (2) Expansion — add a new child node. (3) Simulation — roll out to a terminal state (or use a value estimator). (4) Backup — propagate the outcome up the visited path, updating $N$ and $\hat Q$.

Why UCB The bonus $c\sqrt{\ln N / N(a)}$ is the classical upper confidence bound from multi-armed bandits. It decays as an action is tried more, so the search eventually focuses on the best moves, but it keeps under-visited branches alive — giving asymptotic optimality guarantees under mild conditions.

Asymmetric search MCTS grows the tree deeper in promising directions and barely visits bad ones — unlike minimax with alpha-beta, which explores more uniformly. This is exactly what made it practical for Go, where the branching factor (~250) crushes exhaustive search.

AlphaGo and beyond AlphaGo combined MCTS with two neural networks: a policy net providing $P(a|s)$ as a search prior, and a value net replacing expensive rollouts at the leaves. AlphaZero simplified further — one network outputting both $(\pi, v)$, trained purely from self-play with no human data. MuZero (below) takes the final step: learn the model too.

When to use MCTS shines when you have a simulator (board games, puzzles, physics) and moderate-to-large branching factors. It's naturally discrete — continuous action spaces require progressive widening or clever parameterisations.

Exercises

warm-upImplement UCT for Tic-Tac-Toe. With 10k simulations per move from the empty board, verify your engine never loses to random play.
Show solution

Represent the board as a 9-vector. Legal-actions = empty cells. Rollout = uniform-random play until someone wins or board fills. Reward $\in \{-1, 0, +1\}$ from the perspective of the root player.

Key detail for 2-player zero-sum: negate the backup value at every level (min-max reasoning). Easier: store $Q$ from root's perspective and flip based on whose turn it is at each node.
buildTune the exploration constant $c$ on Connect Four. Run a round-robin where engines with $c \in \{0.5, 1.0, 1.4, 2.0\}$ play each other and report win rates.
Show solution

Play each pair 200 games, swapping who goes first. Record wins/losses/draws. $c = \sqrt{2} \approx 1.4$ is the classical UCB1 value and usually wins or ties for first.

Smaller $c$ is too greedy (misses tactical shots); larger $c$ wastes sims on clearly bad moves. Optimal $c$ depends on reward scale — if you use $\{0, 1\}$ instead of $\{-1, +1\}$ rewards, rescale $c$ accordingly.
stretchImplement PUCT with a hand-crafted policy prior (e.g. "prefer center moves in Connect Four"). Compare against UCT at equal simulation budget — a decent prior should roughly double playing strength.
Show solution

Prior: $P(a | s) = \text{softmax}(\text{column\_centrality})$ with peak on column 3 (center). Selection: $a^* = \arg\max_a [Q(s,a) + c \cdot P(a|s) \cdot \sqrt{N(s)} / (1 + N(s,a))]$.

A sensible prior dramatically speeds search: the tree grows deep on plausible moves instead of wide on equal-looking ones. This is exactly what AlphaZero's policy-net does, just with a neural-net prior instead of a hand-coded one.

Python (UCT sketch)

import math, random

class Node:
    def __init__(self, parent=None):
        self.parent   = parent
        self.children = {}          # action -> Node
        self.N        = 0           # visit count
        self.W        = 0.0         # total value
    @property
    def Q(self): return self.W / self.N if self.N else 0.0

def ucb(child, parent, c=1.4):
    if child.N == 0: return float('inf')
    return child.Q + c * math.sqrt(math.log(parent.N) / child.N)

def mcts(root_state, env, iterations=1000):
    root = Node()
    for _ in range(iterations):
        # 1. SELECT: descend until a node with unvisited children
        node, state = root, root_state.copy()
        while node.children and all(c.N > 0 for c in node.children.values()):
            a, node = max(node.children.items(),
                          key=lambda kv: ucb(kv[1], node))
            state, _, _ = env.step_from(state, a)

        # 2. EXPAND: add children, pick one
        if not env.is_terminal(state):
            for a in env.legal_actions(state):
                node.children[a] = Node(parent=node)
            a = random.choice(list(node.children))
            node = node.children[a]
            state, _, _ = env.step_from(state, a)

        # 3. SIMULATE: random rollout to termination
        G = env.rollout_return(state)

        # 4. BACKUP: propagate G up the path
        while node is not None:
            node.N += 1
            node.W += G
            node = node.parent

    # choose the most-visited action at the root
    return max(root.children.items(), key=lambda kv: kv[1].N)[0]

Planning · MPC · 5.3

PETS

Probabilistic ensembles + trajectory sampling + CEM — the sample-efficient recipe.

[ toggle ]

Components

Probabilistic ensemble: $B$ neural networks, each outputting a Gaussian over next state: $$\tilde f_{\theta_b}(s_{t+1} | s_t, a_t) = \mathcal{N}\big(\mu_{\theta_b}(s_t, a_t),\; \Sigma_{\theta_b}(s_t, a_t)\big)$$ Trajectory Sampling (TS-$\infty$): propagate $P$ particles, each pinned to a single ensemble member, so trajectories can diverge. Planning with CEM: at each real step, optimise an action sequence: $$a^*_{t:t+H-1} = \arg\max_{a_{t:t+H-1}} \mathbb{E}\left[\sum_{\tau=t}^{t+H-1} r(s_\tau, a_\tau)\right]$$ Execute only $a^*_t$, observe the real outcome, replan (MPC style).

Theory & intuition

Two uncertainties Aleatoric uncertainty (inherent stochasticity) is captured by each network outputting a distribution, not a point. Epistemic uncertainty (insufficient data) is captured by the disagreement between ensemble members. Good planning needs both; ignoring either leads to over-confidence.

Why trajectory sampling A naive approach — average predictions across ensemble members at every step — collapses epistemic uncertainty and produces over-confident plans. TS-$\infty$ keeps each particle bound to a single ensemble member for its whole rollout, so trajectories diverge when the ensemble disagrees, correctly representing the uncertainty.

CEM in one paragraph Cross-Entropy Method: sample $N$ action sequences from a Gaussian, score each by running it through the model, keep the top $k$ elites, refit the Gaussian to them, repeat. A gradient-free optimiser that handles noisy, non-differentiable objectives — ideal when the objective goes through a learned model.

Sample efficiency Chua et al. (2018) showed PETS matches SAC's asymptotic performance on standard benchmarks in 10–100× fewer environment steps. The price: each real step runs CEM through the ensemble, which is computationally heavy.

Limitation Pure planning, no learned policy — every action requires running CEM from scratch. Fine for low-dimensional control, painful for high-frequency or vision-based tasks.

Exercises

warm-upTrain an ensemble of 5 probabilistic NNs (each outputting $\mu, \Sigma$) on CartPole transitions collected by a random policy. Plot ensemble mean and std of predicted next-state along a test trajectory.
Show solution
```
class ProbNet(nn.Module):
    def forward(self, s, a):
        h = self.trunk(torch.cat([s, a], -1))
        mu  = self.mu_head(h)
        log_var = self.logvar_head(h).clamp(-10, 0.5)
        return mu, log_var.exp()
```
Loss: $\tfrac{1}{2}(y - \mu)^\top \Sigma^{-1} (y - \mu) + \tfrac{1}{2}\log |\Sigma|$ — the Gaussian NLL. Train each of the 5 members on a different bootstrap sample of the buffer to decorrelate them.

buildImplement CEM action-sequence planning (horizon $H = 25$, $N = 500$ samples, top-50 elites, 5 iterations). Run one MPC step and visualise the top-$k$ action sequences.

Show solution

mu, sig = np.zeros((H, act_dim)), np.ones((H, act_dim))
for _ in range(5):
    A = np.random.randn(N, H, act_dim) * sig + mu
    A = np.clip(A, act_low, act_high)
    R = ts_infty_returns(ensemble, s, A, P=20)
    elite = A[R.argsort()[-50:]]
    mu, sig = elite.mean(0), elite.std(0) + 1e-3   # floor prevents collapse

Execute only mu[0], observe real next state, repeat. The $+10^{-3}$ variance floor stops CEM from committing too quickly to a bad mode.

stretchCompare TS-1 (resample ensemble member at every timestep per particle) vs TS-∞ (pin one member per whole rollout). Measure predictive-uncertainty calibration on HalfCheetah — which stays better-calibrated at long horizons?
Show solution

TS-∞ is better-calibrated at long horizons. Intuition: TS-1 averages ensemble disagreement within a trajectory, so it underestimates total uncertainty. TS-∞ keeps each particle committed to one member's worldview, so the ensemble's spread at horizon H correctly reflects epistemic uncertainty.

Quantify with predictive log-likelihood on held-out transitions — TS-∞ should dominate beyond horizon ~10.

Pseudocode

def pets(env, B=5, P=20, H=25, cem_iters=5, N=500, elite=50):
    ensemble = [ProbNet() for _ in range(B)]   # each outputs (μ, Σ)
    buffer   = []

    for episode in range(num_episodes):
        s, done = env.reset(), False
        while not done:
            # ── CEM plan over action sequences of length H
            mean, std = zeros(H, act_dim), ones(H, act_dim)
            for _ in range(cem_iters):
                A = sample_gaussian(mean, std, N)      # (N, H, act_dim)
                returns = ts_infty_returns(ensemble, s, A, P)
                top = A[returns.argsort()[-elite:]]
                mean, std = top.mean(0), top.std(0)
            a = mean[0]                                # MPC: execute first action

            s2, r, done = env.step(a)
            buffer.append((s, a, s2))
            s = s2

        # ── retrain ensemble with bootstrap resampling
        for net in ensemble:
            net.fit(bootstrap_sample(buffer))

def ts_infty_returns(ensemble, s, A, P):
    """Each particle bound to one ensemble member throughout its rollout."""
    particles = [s] * P
    members   = [random.choice(ensemble) for _ in range(P)]
    returns   = zeros(A.shape[0])
    for i, act_seq in enumerate(A):
        for t in range(act_seq.shape[0]):
            for p in range(P):
                mu, sig = members[p](particles[p], act_seq[t])
                particles[p] = mu + sig * randn()
                returns[i]  += reward(particles[p], act_seq[t]) / P
    return returns

Model + SAC · 5.4

MBPO

Short model rollouts branched off real states, fed into an off-policy algorithm.

[ toggle ]

The bound that justifies it

Let $\epsilon_m$ be the model's one-step error and $\epsilon_\pi$ the policy shift between the collector and the current policy. For $k$-step rollouts: $$\eta^{\text{true}}(\pi) \ge \eta^{\text{model}}(\pi) - \underbrace{\tfrac{2\gamma \, r_{\max}}{(1-\gamma)^2}\,[\,\epsilon_\pi + k \cdot \epsilon_m\,]}_{\text{compounding error}}$$ Small $k$ keeps the bound tight even with an imperfect model.

Theory & intuition

The core insight Model error compounds with rollout length. A 1% per-step error becomes a ~40% error after 50 steps. Long model rollouts drift off the data manifold into regions where the model was never trained — and the policy happily exploits those hallucinations, breaking training.

Branched rollouts Instead of long rollouts from the initial state, MBPO starts rollouts from real states in the replay buffer and runs them for only $k$ steps (typically 1–5). You stay close to the data manifold, generate many short rollouts instead of a few long ones, and the error bound stays tight.

SAC underneath The policy optimiser is SAC. MBPO fills a separate buffer with $k$-step model rollouts; SAC samples a mixture (often 95% model / 5% real) from the two buffers. Because SAC is off-policy, it doesn't care that the synthetic data wasn't collected by the current policy.

Schedule The rollout horizon $k$ is typically annealed upward during training: as the model improves and the policy stabilises, longer rollouts become trustworthy — and more useful.

Result Janner et al. (2019) matched SAC's asymptotic performance on MuJoCo benchmarks in ~5× fewer environment steps, with far less compute than PETS. MBPO became the go-to template for pairing learned dynamics with modern off-policy RL.

Exercises

warm-upBuild a 5-member probabilistic ensemble world model. Train it on a replay buffer from random CartPole rollouts; verify one-step prediction MSE decreases with more data.
Show solution

Same ensemble architecture as PETS. Re-fit every 250 env steps from a fresh bootstrap of the current buffer. Track MSE of $\mu$-prediction on held-out transitions as the buffer grows — it should drop smoothly toward the ~irreducible noise floor.
buildImplement $k$-step branched rollouts: start from real states in the replay buffer, step forward $k$ times through random ensemble members, stash into SAC's buffer. Plug into your SAC implementation.
Show solution
```
for _ in range(M):                    # M ≈ 400 branched rollouts / env step
    s = D_env.sample_state()
    for _ in range(k):
        a = actor(s)
        net = random.choice(ensemble)
        mu, sig = net(s, a)
        s2 = mu + sig * torch.randn_like(sig)
        D_model.add((s, a, reward_fn(s, a), s2, False))
        s = s2
```
SAC updates sample from D_model with real-ratio ~0.05 — i.e., 95% of each batch is synthetic. This is the MBPO recipe in miniature.
stretchAnneal the rollout horizon $k$ from 1 to 5 over training. Compare against a fixed $k$ on Hopper — the annealed schedule should converge faster and reach higher asymptotic return.
Show solution

Schedule from Janner et al.: k = min(max_k, 1 + (epoch / 50)). Early in training the model is bad, so short rollouts keep error in check; as it improves, longer rollouts extract more value.

Fixed $k=5$ from scratch: unstable early, sometimes diverges. Fixed $k=1$: stable but asymptotically capped. Annealed: best of both.

Pseudocode

def mbpo(env, sac, ensemble, k_schedule, num_steps, M=400, G=20):
    D_env, D_model = ReplayBuffer(), ReplayBuffer()
    s = env.reset()

    for step in range(num_steps):
        # ── real step
        a = sac.act(s)
        s2, r, done = env.step(a)
        D_env.add((s, a, r, s2, done))
        s = env.reset() if done else s2

        # ── refit ensemble periodically
        if step % model_update_freq == 0:
            ensemble.fit(D_env)

        # ── generate M branched k-step model rollouts
        k = k_schedule(step)
        for _ in range(M):
            s_branch = D_env.sample_state()
            for _ in range(k):
                a_m  = sac.act(s_branch)
                net  = random.choice(ensemble)
                s_m2, r_m, done_m = net.step(s_branch, a_m)
                D_model.add((s_branch, a_m, r_m, s_m2, done_m))
                if done_m: break
                s_branch = s_m2

        # ── SAC updates on mixed batch (heavily synthetic)
        for _ in range(G):
            batch = mix(D_env, D_model, real_ratio=0.05)
            sac.update(batch)

Latent World Model · 5.5

Dreamer (V1 / V2 / V3)

Learn a compact latent dynamics model; do all policy learning in imagination.

[ toggle ]

Recurrent State-Space Model (RSSM)

Each latent splits into deterministic $h_t$ and stochastic $z_t$: $$h_t = f_\theta(h_{t-1}, z_{t-1}, a_{t-1})$$ Prior (prediction): $p_\theta(\hat z_t \mid h_t)$ Posterior (inference from obs): $q_\phi(z_t \mid h_t, o_t)$ Decoders: $p_\theta(o_t \mid h_t, z_t),\quad p_\theta(r_t \mid h_t, z_t)$ Trained jointly by maximising the ELBO: $$\mathcal{L} = \mathbb{E}_q\big[\log p(o_t \mid h_t, z_t) + \log p(r_t \mid h_t, z_t)\big] - \mathrm{KL}\big(q_\phi(z_t \mid \cdot)\,\|\,p_\theta(\hat z_t \mid h_t)\big)$$

Theory & intuition

Learn once, dream forever Train a world model on collected experience; then do all policy learning in the latent imagination. Sample $h_0$ from the replay buffer, unroll the learned dynamics in latent space for $H$ steps, compute imagined rewards and values, backprop through everything. No environment steps wasted on policy gradient noise.

Why the split latent The deterministic $h_t$ carries long-term information (a GRU state); the stochastic $z_t$ captures per-step uncertainty. Together they're expressive enough to model complex pixel environments, but compact enough to roll out hundreds of imagined steps cheaply.

Actor-critic in imagination Two small MLPs: an actor $\pi_\psi(a \mid h, z)$ and a value $v_\xi(h, z)$. Because the latent dynamics are fully differentiable, the policy gradient can flow back through the imagined trajectory — no score function estimator needed, variance is drastically lower than model-free policy gradient.

Dreamer's evolution V1 (2020) — pixel control suite. V2 (2021) — replaced Gaussian latents with categorical ones, huge stability gains, mastered Atari. V3 (2023) — added symlog transforms, tuned hyperparameters; the same network and settings solved 150+ tasks from Atari to Minecraft, including the first agent to mine diamonds from scratch.

Why it matters Dreamer is today's reference point for sample-efficient learning from pixels. The paradigm — learn a latent dynamics model, then do all RL in imagination — has become the dominant template for model-based deep RL.

Exercises

warm-upImplement the RSSM: GRU for $h_t$, Gaussian prior $p(\hat z | h)$, encoder posterior $q(z | h, o)$, and decoders for observation and reward. Train on a low-dim DMC task; verify reconstruction looks sane.
Show solution

Four components: gru(h, z, a) -> h', prior(h) -> (μ_hat, σ_hat), encoder(h, o) -> (μ, σ), decoder(h, z) -> o_hat. Train on sequences of length 50 with $\text{ELBO} = \text{recon} - \beta\, \text{KL}(\text{posterior} \| \text{prior})$.

Sanity check: pass an observation through encoder→decoder with a fixed prior-sampled $z$; reconstruction should look like a blurry version of the input.
buildTrain the actor and critic entirely in imagination: unroll your RSSM for $H = 15$ steps, compute $\lambda$-returns, backprop through the learned dynamics. Verify policy performance on the real env improves.
Show solution

Sample starting states from the encoded real trajectories. Unroll without the encoder (use prior samples for $z_t$) for 15 steps, record imagined $r_t, v_t$. Compute $V^\lambda_t$ backwards with $\lambda \approx 0.95$. Actor loss maximises $V^\lambda$; critic loss regresses $V_\psi$ onto $V^\lambda$.

Because the entire unroll is differentiable, actor gradients flow through dynamics — much lower variance than REINFORCE.
stretchReplace the Gaussian latent with 32 categoricals of 32 dims (DreamerV2 style), using straight-through gradients. Compare training stability against the Gaussian version — the categorical should be dramatically more robust.
Show solution

Latent $z$: 32 independent categorical variables, each 32-way. Sample via Gumbel-softmax or straight-through: $z = \text{onehot}(\arg\max(\text{logits}))$ forward, $z = \text{softmax}(\text{logits})$ backward.

Advantages: bounded information content (prevents posterior collapse), discrete "concepts" are easier for the dynamics to manipulate, gradients don't flow through variance parameters. Hafner et al. (DreamerV2) report this change alone doubled Atari performance.

Pseudocode

def dreamer_step(world_model, actor, critic, buffer,
                 imag_horizon=15, lam=0.95):
    # ── 1. WORLD-MODEL LEARNING on real sequences
    seqs = buffer.sample_sequences(batch=B, length=L)
    h_t, z_prev = zeros(B, h_dim), zeros(B, z_dim)
    H, Z, loss_wm = [], [], 0

    for t in range(L):
        h_t  = world_model.gru(h_t, z_prev, seqs.act[t])
        pri  = world_model.prior(h_t)                          # p(ẑ | h)
        post = world_model.encode(h_t, seqs.obs[t])            # q(z  | h, o)
        z_t  = post.sample()
        o_pred = world_model.decode_obs(h_t, z_t)
        r_pred = world_model.decode_rew(h_t, z_t)
        loss_wm += recon(o_pred, seqs.obs[t]) + recon(r_pred, seqs.rew[t]) \
                 + kl(post, pri)
        H.append(h_t); Z.append(z_t); z_prev = z_t
    world_model.optimize(loss_wm)

    # ── 2. BEHAVIOUR LEARNING entirely in imagination
    # Start imagination from every real latent state above.
    h_imag, z_imag = stack(H).detach(), stack(Z).detach()
    rew, val, logp = [], [], []
    for _ in range(imag_horizon):
        a      = actor(h_imag, z_imag)
        h_imag = world_model.gru(h_imag, z_imag, a)
        z_imag = world_model.prior(h_imag).sample()
        rew.append(world_model.decode_rew(h_imag, z_imag))
        val.append(critic(h_imag, z_imag))
        logp.append(actor.log_prob(a))

    returns = lambda_return(rew, val, lam=lam)
    actor.optimize(-(stack(logp) * returns).mean())
    critic.optimize(mse(stack(val), returns.detach()))

Learned Model + MCTS · 5.6

MuZero

Mastering Atari, Go, chess & shogi with a model that never reconstructs a pixel.

[ toggle ]

Three learned functions

Representation (observations → latent): $$s^0 = h_\theta(o_1, o_2, \ldots, o_t)$$ Dynamics (latent transition + reward): $$s^{k+1},\, r^{k+1} = g_\theta(s^k, a^{k+1})$$ Prediction (policy + value from latent): $$\pi^k,\, v^k = f_\theta(s^k)$$ Plan with MCTS in the learned latent space, using PUCT with the network's $\pi^k$ as prior. Training loss, over a $K$-step unroll: $$\mathcal{L} = \sum_{k=0}^{K} \underbrace{\mathrm{CE}(\pi^k, \hat\pi^k)}_{\text{policy: match MCTS}} + \underbrace{(v^k - z_k)^2}_{\text{value: n-step bootstrap}} + \underbrace{(r^k - u_k)^2}_{\text{reward: observed}}$$

Theory & intuition

AlphaZero's missing piece AlphaZero still needed to know the rules of the game — its dynamics function was provided by the environment. MuZero learns an implicit model: $g_\theta$ predicts next latent states and rewards without ever reconstructing observations. No rules, no simulator, no hand-coded transitions.

Value-equivalent modelling This is MuZero's key philosophical break. Dreamer learns a model that predicts what the next frame looks like. MuZero learns a model that predicts only what matters for planning — reward, value, and policy — and nothing else. The latent space has no obligation to correspond to physical state; it only has to support good predictions of these three quantities.

Why it scales The same architecture mastered Go (perfect information, huge branching), Atari (stochastic, pixel-based), chess, and shogi — domains that previously each required bespoke, hand-crafted models. The clean separation into $(h, g, f)$ lets each function focus on one narrow task.

MCTS in latent space Planning unrolls $g_\theta$ through the search tree, using $f_\theta$ at every node for the policy prior and value estimate. Policy improvement comes from the MCTS visit distribution $\hat\pi$ being a better policy than the raw $\pi^0$ output by the network — so the network is then trained to match $\hat\pi$, a form of policy distillation.

Descendants EfficientZero (Ye 2021) matched MuZero's Atari performance with 500× less data by adding self-supervised latent consistency. Stochastic MuZero handles stochastic environments. Sampled MuZero handles continuous actions. The template continues to expand.

Exercises

warm-upImplement the three functions $(h_\theta, g_\theta, f_\theta)$ for a 4×4 tile-swap puzzle. Train jointly with policy, value, and reward losses on offline trajectories.
Show solution

$h$: board → latent (MLP). $g$: (latent, action) → (next latent, reward). $f$: latent → (policy logits, value). Use the joint loss from §5.6, unrolled $K = 5$ steps from each training sample.

Targets: $\pi^{\text{MCTS}}$ (from a search — at first just random), $z_k$ = n-step bootstrap return, $u_k$ = observed reward. Track each loss individually; they should all decrease in tandem.
buildAdd MCTS-in-latent-space: use PUCT with $f_\theta$'s policy prior, unroll $g_\theta$ at each child node, back up predicted rewards and values. Show the search-derived policy beats the raw network policy.
Show solution

At root: $s_0 = h(o)$. Descend via PUCT; at each new child, apply $g$ to get $(s_\text{next}, r)$; evaluate $f(s_\text{next}) = (\pi, v)$; back up $r + \gamma v$ along the path (then $r + \gamma(r' + \gamma v')$ one level up, etc.).

After 50 simulations, the visit-count distribution $\hat\pi$ should substantially out-perform the raw $\pi^0 = f(s_0)$. That gap is what you then distill back into the network — the policy improvement loop.
stretchTrain MuZero on CartPole (treating the simulator as a black box). Verify: (a) the agent learns without any access to true dynamics, and (b) the learned $g_\theta$'s reward predictions track real rewards when inspected along a rollout.
Show solution

CartPole is continuous, so use Sampled MuZero: sample $K$ candidate actions at each node instead of enumerating. For CartPole with discrete $\{0, 1\}$ you can skip the sampling.

Sanity check (b): after training, roll out $g_\theta$ for 20 steps in latent space; compare predicted rewards to actual rewards from the real env with the same action sequence. They should match closely (<10% error) within the learned model's "in-distribution" region.

Pseudocode (training + planning sketches)

def muzero_train_step(net, replay, K=5):
    # Sample a trajectory segment + MCTS targets
    obs_hist, actions, targets = replay.sample(unroll_steps=K)
    # targets[k] = (π̂_k from MCTS, z_k = n-step bootstrap return, u_k = reward)

    s = net.representation(obs_hist)                  # h_θ
    total_loss = 0
    for k in range(K + 1):
        pi_k, v_k = net.prediction(s)                 # f_θ
        pi_hat, z_target, u_target = targets[k]

        total_loss += ce(pi_k, pi_hat) + (v_k - z_target).pow(2)
        if k > 0:
            total_loss += (r_k - u_target).pow(2)
        if k < K:
            s, r_k = net.dynamics(s, actions[k])      # g_θ

    total_loss.backward()
    net.optimize()

def muzero_act(net, obs, num_sims=50):
    """MCTS in the *learned* latent space."""
    root = Node()
    root.s          = net.representation(obs)
    root.pi, root.v = net.prediction(root.s)

    for _ in range(num_sims):
        # SELECT via PUCT using node.pi as prior
        node, path = root, [root]
        while node.children:
            a, node = select_puct(node)
            path.append(node)

        # EXPAND using learned dynamics
        parent = path[-2]
        s_new, r_new = net.dynamics(parent.s, node.action)
        node.s, node.r = s_new, r_new
        node.pi, node.v = net.prediction(s_new)
        node.children = {a: Node(action=a) for a in legal_actions}

        # BACKUP using predicted reward + value
        backup(path, r_new, node.v)

    return sample_from_visits(root)

§ VI

Eligibility Traces & TD(λ)

Bootstrapping · 6.1

n-step TD

The bridge between one-step TD and Monte Carlo — a single dial for bias vs variance.

[ toggle ]

n-step return

Accumulate $n$ real rewards, then bootstrap off the value estimate: $$G_{t:t+n} = r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^{n-1} r_{t+n} + \gamma^n V(s_{t+n})$$ Update: $$V(s_t) \leftarrow V(s_t) + \alpha\big[G_{t:t+n} - V(s_t)\big]$$ Two limits: $n = 1$ recovers TD(0); $n \to \infty$ recovers Monte Carlo.

Theory & intuition

Bias–variance dial Small $n$: heavy bootstrapping, biased by the current value estimate but low variance because the target depends on few random transitions. Large $n$: less bootstrapping, unbiased in expectation but high variance because the target depends on many stochastic rewards. The best $n$ is problem-dependent — there's no free lunch.

Why it matters In practice, intermediate $n$ (typically 4–20) often dominates both endpoints. TD(0) propagates information one step at a time and can be agonizingly slow on long horizons; Monte Carlo has to wait for episode end and suffers from variance.

Delay The update for state $s_t$ can't be computed until $n$ more steps have been observed — you have to delay the update until $t + n$. This is why online implementations need a small FIFO buffer of recent transitions.

Gateway drug n-step TD is the conceptual stepping stone to TD(λ): instead of picking one value of $n$, average them all with exponentially decaying weights.

Exercises

warm-upImplement n-step SARSA. On Sutton & Barto's 19-state random walk, sweep $n \in \{1, 2, 4, 8, 16\}$ and plot RMS value error vs $n$.
Show solution

Env: 19 non-terminal states in a chain, left-terminal gives $-1$, right-terminal gives $+1$, random walk policy. Ground-truth values are linear from $-0.9$ to $+0.9$.

Expect a U-shape: error high at $n=1$ (slow propagation), falls to a minimum around $n=4$–$8$, then rises at $n=16$ (too much variance for this horizon). Sutton & Barto Figure 7.2 is the target shape.
buildReproduce the bias–variance story numerically: for each $n$, run many seeds and plot variance of the n-step target and bias (signed error against Monte-Carlo) as a function of $n$.
Show solution

For a fixed state $s$, collect many n-step returns from independent rollouts. Variance = sample variance of these returns. Bias = their mean minus the true Monte-Carlo value (which you estimate with very long rollouts).

Expect: variance grows with $n$; |bias| shrinks with $n$. The optimal $n$ balances both — different at different stages of training as the value estimate improves.
stretchImplement n-step Tree Backup (off-policy, no importance sampling). Compare data efficiency against Watkins's Q(λ) on a small stochastic gridworld.
Show solution

Tree Backup return uses expected-value bootstraps at every level: the n-step target averages over all non-taken actions at each intermediate state using the target policy $\pi$. No importance ratios are needed because we never sample along off-policy branches — we integrate.

Advantage over Watkins's Q(λ): no trace cuts, so the full n-step signal always propagates. Precup (2000) is the original reference.

Python (n-step SARSA)

import numpy as np
from collections import deque

def n_step_sarsa(env, n=4, episodes=500, alpha=0.1, gamma=0.99, eps=0.1):
    Q = np.zeros((env.n_states, env.n_actions))
    for _ in range(episodes):
        s = env.reset()
        a = eps_greedy(Q, s, eps)
        buf = deque()           # stores (s, a, r)
        T = float('inf'); t = 0

        while True:
            if t < T:
                s2, r, done = env.step(a)
                buf.append((s, a, r))
                if done:
                    T = t + 1
                else:
                    a2 = eps_greedy(Q, s2, eps)

            tau = t - n + 1     # time whose update we're computing
            if tau >= 0:
                # n-step return: accumulate rewards, bootstrap if not terminal
                G = sum(gamma**i * buf[i][2] for i in range(min(n, T - tau)))
                if tau + n < T:
                    G += gamma**n * Q[s2, a2]
                s_tau, a_tau, _ = buf.popleft()
                Q[s_tau, a_tau] += alpha * (G - Q[s_tau, a_tau])

            if tau == T - 1: break
            s, a = s2, a2 if t < T else a
            t += 1
    return Q

Eligibility Traces · 6.2

TD(λ)

All n-step returns, averaged exponentially — implemented by a backward-running trace.

[ toggle ]

Forward view — the λ-return

Weight every n-step return by $\lambda^{n-1}$ and normalise: $$G_t^\lambda = (1 - \lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_{t:t+n}$$ Update: $V(s_t) \leftarrow V(s_t) + \alpha[G_t^\lambda - V(s_t)]$. Limits: $\lambda = 0$ ⇒ TD(0); $\lambda = 1$ ⇒ Monte Carlo (offline).

Backward view — eligibility traces

Maintain an eligibility trace $e_t(s)$ for every state: $$e_t(s) = \gamma\lambda \, e_{t-1}(s) + \mathbb{1}\{s_t = s\}$$ Compute the one-step TD error: $$\delta_t = r_{t+1} + \gamma V(s_{t+1}) - V(s_t)$$ Update every state in proportion to its trace: $$V(s) \leftarrow V(s) + \alpha \, \delta_t \, e_t(s), \quad \forall s$$

Theory & intuition

The equivalence The forward and backward views produce (approximately) the same updates — they're two ways to describe the same algorithm. Forward says "look ahead and average n-step returns"; backward says "propagate this step's error back through a decaying memory of visited states." The backward view is what you actually implement, because it's online and local.

Credit assignment Traces solve RL's hardest problem — which earlier action caused this reward? — by keeping a fading record of recent states. When a reward finally arrives, every state with non-zero trace shares the credit, weighted by how recently it was visited.

Two trace flavours Accumulating traces add 1 each visit (can exceed 1 if a state is re-visited). Replacing traces clip to 1 — often more stable in tabular settings, especially with revisits during exploration.

Why it's fast Information propagates backward by $\gamma\lambda$ per step — so after $n$ steps a reward has already updated states $n$ back. Compared to TD(0), where each value update takes one bootstrap step to propagate one hop, traces give a qualitative speedup in long-horizon problems.

Caveat The backward-view update is approximately, not exactly, the forward-view update online — they only match at the end of an episode or offline. True Online TD(λ) (below) fixes this.

Exercises

warm-upImplement the backward view with accumulating traces on the 19-state random walk. Sweep $\lambda \in \{0, 0.4, 0.8, 0.95, 1.0\}$ and plot RMS error.
Show solution

Traces $e \in \mathbb{R}^{19}$ reset every episode. Update all states in parallel at each step: V += alpha * delta * e; e *= gamma * lam; then e[s] += 1.

Similar U-shape to n-step: low-$\lambda$ slow, very-high-$\lambda$ high-variance. Sweet spot usually $\lambda \in [0.8, 0.95]$. This is Sutton & Barto Figure 12.3.
buildAdd replacing traces ($e[s] \leftarrow 1$ instead of $e[s] \mathrel{+{=}} 1$). Show empirically that accumulating and replacing differ whenever states are revisited before their traces decay.
Show solution

On the random walk with $\lambda = 0.9$, run both variants. Sampling means states do get revisited with non-zero trace — accumulating lets $e[s] > 1$ briefly, replacing clips to 1.

Differences are small on simple problems (under 5% RMS). They grow with: longer trace horizons ($\lambda \to 1$), $\gamma \to 1$, and high-revisit policies. Replacing is usually slightly more stable.
stretchVerify the forward-vs-backward equivalence: compute the $\lambda$-return $G_t^\lambda$ at episode end and compare to the offline-accumulated backward updates. They should match to numerical precision.
Show solution

Run an episode without updating $V$ (freeze). For each visited state $s_t$, compute $G_t^\lambda = (1-\lambda)\sum_n \lambda^{n-1} G_{t:t+n}$ explicitly (finite sum, since the episode terminates).

Separately, accumulate the backward-view updates offline (also without applying). Sum the offline updates per state and compare: $\sum_t \alpha \delta_t e_t(s) = \alpha \sum_{t: s_t = s} (G_t^\lambda - V(s))$, exactly. Match should be to ~$10^{-12}$.

Python (tabular, backward view)

import numpy as np

def td_lambda(env, episodes=500, alpha=0.1, gamma=0.99, lam=0.9):
    V = np.zeros(env.n_states)
    for _ in range(episodes):
        e = np.zeros(env.n_states)          # fresh traces per episode
        s, done = env.reset(), False
        while not done:
            s2, r, done = env.step(policy(s))
            delta = r + gamma * V[s2] * (not done) - V[s]

            # accumulating trace (use e[s] = 1 for replacing traces)
            e[s] += 1
            V += alpha * delta * e          # update *every* state
            e *= gamma * lam                # decay all traces
            s = s2
    return V

On-policy Control · 6.3

SARSA(λ)

SARSA with traces over state–action pairs — often the fastest tabular control method.

[ toggle ]

Update

TD error (on-policy, as in SARSA): $$\delta_t = r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)$$ Trace over state–action pairs: $$e_t(s, a) = \gamma\lambda \, e_{t-1}(s, a) + \mathbb{1}\{(s, a) = (s_t, a_t)\}$$ Update all $(s, a)$: $$Q(s, a) \leftarrow Q(s, a) + \alpha \, \delta_t \, e_t(s, a)$$

Theory & intuition

SARSA + traces Exactly SARSA's on-policy update, but with an eligibility trace over state–action pairs instead of just states. Reward signal propagates back along the entire recent trajectory in one update, not one step at a time.

Mountain Car demo The textbook case. Vanilla SARSA struggles here — reaching the flag requires a long sequence of correct actions, and reward only arrives at the end. With $\lambda \approx 0.9$, traces propagate that terminal reward back through the entire climb, and learning is roughly an order of magnitude faster.

Why on-policy is natural here Because the trace records actions actually taken, and the TD target uses the next actually chosen action, SARSA(λ) is internally consistent — no correction terms needed, unlike off-policy $Q(\lambda)$ below.

Function approximation With linear function approximation, traces become vectors of the same shape as the weights: $\vec e_t = \gamma\lambda \vec e_{t-1} + \vec \phi(s_t, a_t)$. With non-linear nets, traces are more awkward — modern deep RL uses GAE (below) to get the same effect in a batched way.

Exercises

warm-upImplement SARSA(λ) with tile coding on MountainCar-v0. Tune tilings and width so that the agent reaches the goal in under 200 steps within a few hundred episodes.
Show solution

Standard recipe: 8 tilings, 8×8 tiles per tiling → 512 features. $\alpha = 0.5 / 8 = 0.0625$, $\lambda = 0.9$, $\gamma = 1$, $\varepsilon = 0$ (tile coding's generalisation is enough).

Reference: Sutton & Barto Figure 12.10 — MountainCar with tile coding + SARSA(λ) is the canonical "traces make things work" demo. Should solve in ~100 episodes to under 200 steps avg.
buildSweep $\lambda \in \{0.0, 0.5, 0.9, 0.99, 1.0\}$ on Mountain Car. Plot learning curves — the improvement going from 0 to ~0.9 should be dramatic.
Show solution

At $\lambda = 0$ (plain SARSA), you might never solve within a budget of 500 episodes — the terminal reward has to crawl back one step per episode. At $\lambda = 0.9$ it hits the goal in ~100 episodes. At $\lambda = 1$ (pure MC) it's unstable.
stretchImplement True Online SARSA(λ) (dutch traces + correction). Compare to standard SARSA(λ) at aggressive step sizes ($\alpha = 0.5 / \text{tilings}$) — the true-online version should be noticeably more stable.
Show solution

Maintain a running Q_old (the $Q(s_t, a_t)$ value from before the parameter update). Dutch trace formula for linear features is in §6.5's code block. Correction term: $-\alpha (Q - Q_\text{old}) \phi_t$.

At $\alpha / \text{tilings} \approx 0.5$, standard SARSA(λ) sometimes diverges while True Online holds steady. See van Seijen & Sutton (2014) Figure 2.

Python

import numpy as np

def sarsa_lambda(env, episodes=500, alpha=0.1, gamma=0.99,
                 lam=0.9, eps=0.1):
    Q = np.zeros((env.n_states, env.n_actions))
    for _ in range(episodes):
        e = np.zeros_like(Q)                # fresh traces
        s = env.reset()
        a = eps_greedy(Q, s, eps)
        done = False
        while not done:
            s2, r, done = env.step(a)
            a2 = eps_greedy(Q, s2, eps) if not done else 0
            delta = r + gamma * Q[s2, a2] * (not done) - Q[s, a]

            e[s, a] += 1                    # accumulating trace
            Q += alpha * delta * e
            e *= gamma * lam

            s, a = s2, a2
    return Q

Off-policy Control · 6.4

Watkins's Q(λ)

Q-learning with traces — but cut the trace whenever you explore.

[ toggle ]

Update

Off-policy TD error (max over next actions, as in Q-learning): $$\delta_t = r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t)$$ Trace evolves normally only if the action just taken was greedy. Otherwise, the trace is reset: $$e_t(s, a) = \begin{cases} \gamma\lambda \, e_{t-1}(s, a) + \mathbb{1}\{(s,a) = (s_t, a_t)\} & \text{if } a_t = \arg\max_{a'} Q(s_t, a') \\ 0 & \text{otherwise} \end{cases}$$

Theory & intuition

The tension Q-learning is off-policy: it targets the greedy policy while behaving $\varepsilon$-greedily. Traces are a way of propagating n-step returns along the trajectory you actually took. These are at odds — any n-step segment containing an exploratory action represents behaviour, not the greedy target policy, so propagating rewards back through it is biased.

Watkins's fix Cut the trace whenever a non-greedy (exploratory) action is taken. From that point on, the next trace segment starts fresh. This is provably sound but wasteful — traces are often cut before they can do much work, especially under high $\varepsilon$.

Peng's Q(λ) An alternative that never cuts traces — uses a mixed-policy target. It usually works better empirically but has no formal convergence guarantees. The practical divide between "theoretically sound but limited" and "works well but fragile" runs right through this algorithm.

Modern heirs Tree Backup (Precup 2000) handles off-policy multi-step without importance sampling. Retrace(λ) (Munos 2016) uses clipped importance ratios — safer than naïve IS, more data-efficient than Watkins. V-trace (IMPALA, 2018) applies the same idea at massive scale. All descend from the same problem Watkins identified.

Exercises

warm-upImplement Watkins's Q(λ) on a small gridworld with $\varepsilon = 0.3$. Instrument the loop to count how many times per episode the trace gets cut by an exploratory action.
Show solution

Counter: if a2 != a_star: cuts += 1. Over an episode, expect ~$\varepsilon \cdot T$ cuts ($T$ = episode length). At $\varepsilon = 0.3$ that's a cut roughly every 3 steps — traces are severely truncated.

This is the diagnostic that motivates Peng's and Retrace's alternatives: Watkins's trace rarely accumulates enough to do meaningful credit assignment.
buildImplement Peng's Q(λ) (no trace cutting, mixed-policy target). On the same gridworld, compare convergence speed and variance against Watkins's.
Show solution

TD error: $\delta_t = r + \gamma \max_{a'} Q(s', a') - Q(s, a)$ (same as Q-learning). Trace update: never cut — just e *= gamma * lam; e[s, a] += 1.

Empirically: Peng's converges faster and more smoothly. It's biased (target is neither pure-greedy nor pure-behaviour) but the bias is small relative to the variance reduction from not cutting traces.
stretchImplement Retrace(λ) with clipped importance ratios $c_t = \lambda \min(1, \pi/\mu)$. Compare all three (Watkins, Peng, Retrace) in an off-policy setting where behaviour $\mu$ is uniform-random and target $\pi$ is $\varepsilon$-greedy.
Show solution

Trace update: e[s,a] *= gamma * c_t, where $c_t = \lambda \cdot \min(1, \pi(a|s)/\mu(a|s))$. Clipping at 1 prevents importance ratio blow-up while keeping the trace mostly intact for reasonably-on-policy samples.

Retrace(λ) dominates both on off-policy benchmarks. Munos et al. (2016) proved it converges for arbitrary behaviour policies — a formal guarantee Peng lacks.

Python

import numpy as np

def watkins_q_lambda(env, episodes=500, alpha=0.1, gamma=0.99,
                     lam=0.9, eps=0.1):
    Q = np.zeros((env.n_states, env.n_actions))
    for _ in range(episodes):
        e = np.zeros_like(Q)
        s = env.reset()
        a = eps_greedy(Q, s, eps)
        done = False
        while not done:
            s2, r, done = env.step(a)
            a_star = int(np.argmax(Q[s2])) if not done else 0
            a2     = eps_greedy(Q, s2, eps) if not done else 0

            # off-policy target (greedy)
            delta = r + gamma * Q[s2, a_star] * (not done) - Q[s, a]

            e[s, a] += 1                       # accumulate for this step
            Q += alpha * delta * e

            # cut trace if the action just taken is non-greedy
            if a2 == a_star: e *= gamma * lam
            else:            e[:] = 0

            s, a = s2, a2
    return Q

Refinement · 6.5

True Online TD(λ)

A small correction that makes the backward view exactly equal to the forward view.

[ toggle ]

Dutch traces + correction term

With linear function approximation $V(s) = \boldsymbol\theta^\top \boldsymbol\phi(s)$, True Online TD(λ) uses a dutch trace: $$\boldsymbol e_t = \gamma\lambda \, \boldsymbol e_{t-1} + \big(1 - \alpha \gamma\lambda \, \boldsymbol e_{t-1}^\top \boldsymbol\phi_t\big) \boldsymbol\phi_t$$ And an augmented update including a correction: $$\boldsymbol\theta_{t+1} = \boldsymbol\theta_t + \alpha (\delta_t + V_t - V_t^{\text{old}})\, \boldsymbol e_t - \alpha (V_t - V_t^{\text{old}})\, \boldsymbol\phi_t$$ where $V_t^{\text{old}} = \boldsymbol\theta_{t-1}^\top \boldsymbol\phi_t$ — the value of $s_t$ under the previous parameters.

Theory & intuition

The defect in standard TD(λ) Van Seijen & Sutton (2014) observed that the backward-view update and the forward-view λ-return update only approximately agree online; they coincide exactly only at episode boundaries, and only if parameters stop changing mid-episode. With function approximation and large step sizes, the discrepancy can accumulate and hurt performance.

Dutch traces Accumulating and replacing traces were both heuristics; dutch traces are derived from first principles to make the online update algebraically equal to the offline forward-view update. The trace formula now includes the feature vector $\boldsymbol\phi$ directly — no more choice between "accumulating" and "replacing."

The correction term The extra $-\alpha(V_t - V_t^{\text{old}}) \boldsymbol\phi_t$ compensates for the fact that, by the time you update $s_t$'s value, you've already changed the parameters that computed its earlier predictions. Subtracting this correction cancels the bias introduced by the moving target.

Practical effect Usually better than standard TD(λ), especially with larger step sizes, larger $\lambda$, or aggressive function approximation. The variants True Online SARSA(λ) and True Online Q(λ) follow the same recipe for control.

Legacy True Online TD(λ) closed a 30-year-old gap between theory and implementation. Even when modern deep RL doesn't use eligibility traces explicitly, the insight — that online updates to a moving value function need correction terms — shows up in things like Retrace and careful n-step target designs.

Exercises

warm-upImplement True Online TD(λ) with linear features on the 19-state random walk. Compare RMS error against standard TD(λ) at large step sizes ($\alpha = 0.5$) — the true-online version should be visibly more stable.
Show solution

Use one-hot features (each state gets its own weight — equivalent to tabular, but lets you share code with larger feature sets later). Keep V_old from the previous step. Apply the dutch-trace formula and correction exactly as in the reference code.

At $\alpha = 0.5$, standard TD(λ) oscillates wildly or diverges; True Online converges cleanly. This is Figure 3 of van Seijen & Sutton (2014).
buildDerive the dutch-trace formula from the forward view on paper: show algebraically that the online backward update with dutch traces equals the offline $\lambda$-return update at episode end for linear $V$.
Show solution

Start from $\theta_\text{new} - \theta_\text{old} = \sum_t \alpha (G_t^\lambda - \theta^\top \phi_t) \phi_t$ (forward view). Expand $G_t^\lambda$ as a sum over TD errors weighted by $(\gamma\lambda)^k$. Rearrange so that the update at time $t$ is a product of $\delta_t$ and an accumulated trace.

The dutch-trace expression $e_t = \gamma\lambda e_{t-1} + (1 - \alpha \gamma\lambda e_{t-1}^\top \phi_t)\phi_t$ emerges from requiring cancellation of the terms that would otherwise come from updating weights mid-episode. Full derivation in van Seijen & Sutton (2014) §3.
stretchExtend to True Online SARSA(λ) for control. Apply to Acrobot-v1 with tile coding. Show it's more robust to step-size choice than standard SARSA(λ) (sweep $\alpha$ and plot failure rate).
Show solution

Same dutch-trace + correction as TD(λ), but with state-action features $\phi(s, a)$. Sweep $\alpha \in \{0.1, 0.3, 0.5, 0.8, 1.0\} / \text{tilings}$ with 20 seeds each.

Expected: standard SARSA(λ) has a window where it works ($\alpha/\text{tilings} \approx 0.1$–$0.4$), fails outside. True Online is robust across the full range.

Python (linear FA)

import numpy as np

def true_online_td_lambda(env, phi, d, episodes=500,
                          alpha=0.1, gamma=0.99, lam=0.9):
    """phi(s) -> feature vector of length d."""
    theta = np.zeros(d)
    for _ in range(episodes):
        e = np.zeros(d)
        s = env.reset()
        V_old = 0.0                        # V_t^old buffer
        done = False
        while not done:
            x = phi(s)
            s2, r, done = env.step(policy(s))
            x2 = phi(s2) if not done else np.zeros(d)

            V     = theta @ x
            V_nxt = theta @ x2
            delta = r + gamma * V_nxt - V

            # dutch trace
            e = gamma * lam * e + (1 - alpha * gamma * lam * (e @ x)) * x
            # true-online update with correction
            theta += alpha * (delta + V - V_old) * e - alpha * (V - V_old) * x

            V_old = V_nxt
            s = s2
    return theta

Advantage Estimation · 6.6

GAE — Generalised Advantage Estimation

TD(λ), reborn for deep policy gradients. The λ in every modern PPO implementation.

[ toggle ]

Definition

Let $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ be the one-step TD residual. Then: $$\hat A_t^{\text{GAE}(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \, \delta_{t+l}$$ Equivalently, a weighted sum of n-step advantages: $$\hat A_t^{\text{GAE}(\gamma,\lambda)} = (1 - \lambda)\sum_{n=1}^\infty \lambda^{n-1} \hat A_t^{(n)}$$ where $\hat A_t^{(n)} = \sum_{k=0}^{n-1}\gamma^k r_{t+k} + \gamma^n V(s_{t+n}) - V(s_t)$.

Theory & intuition

TD(λ), but for policy gradient GAE is literally TD(λ) applied to the advantage rather than the value. The policy gradient $\mathbb{E}[\nabla \log \pi(a|s)\, \hat A]$ is unbiased for any unbiased $\hat A$; the question is always: what's the lowest-variance unbiased estimator? GAE's answer: average all n-step advantages with exponentially decaying weights — just as TD(λ) averages all n-step value targets.

The two dials $\gamma$ controls how far into the future we care. $\lambda$ controls the bias–variance trade in advantage estimation: $\lambda = 1$ is full Monte-Carlo advantage (unbiased, high-variance); $\lambda = 0$ is the one-step advantage $\delta_t$ (maximally biased, lowest variance). Unlike $\gamma$, $\lambda$ doesn't change the objective — it only changes the estimator's statistical properties.

Why $\lambda \approx 0.95$ Schulman et al. (2015) found empirically that $\lambda$ close to 1 but not exactly 1 performs best across MuJoCo tasks. Values around 0.95–0.97 are the standard default in PPO, A2C, and TRPO implementations — including OpenAI's original PPO codebase and every modern RLHF pipeline.

The clean recursion You don't need to compute the infinite sum; work backward through a trajectory with the one-line recursion $\hat A_t = \delta_t + \gamma\lambda \hat A_{t+1}$. That tight little loop is probably the most-executed piece of RL code in the world right now.

Full circle GAE is why eligibility traces still matter even when "traces" as a concept have disappeared from deep RL code. The math is identical; it's just been vectorised, batched, and renamed.

Exercises

warm-upImplement the backward-recursion GAE given rewards, values, dones, gamma, lam. Verify on a toy 5-step trajectory that the output matches the explicit sum $\sum_l (\gamma\lambda)^l \delta_{t+l}$.
Show solution
```
adv = np.zeros(T); gae = 0
for t in reversed(range(T)):
    next_v = last_v if t == T - 1 else values[t + 1]
    delta = rewards[t] + gamma * next_v * (1 - dones[t]) - values[t]
    gae   = delta + gamma * lam * (1 - dones[t]) * gae
    adv[t] = gae
```
Unit test: with $\gamma = 0.99, \lambda = 0.95$, fixed $\delta$ sequence, the loop output should equal sum((gamma*lam)**l * delta[t+l] for l in range(T-t)) for every $t$ — to machine precision.
buildSweep $\lambda \in \{0.9, 0.95, 0.97, 0.99, 1.0\}$ inside your PPO implementation on HalfCheetah-v4. Plot final return vs $\lambda$ — the shape should peak around $\lambda = 0.95$ as in the original paper.
Show solution

Use 5+ seeds per $\lambda$ value; error bars matter on MuJoCo. Keep everything else fixed including $\gamma = 0.99$. Expected peak is $\lambda \in [0.95, 0.97]$.

At $\lambda = 1.0$, returns are high-variance and unbiased (pure MC) — often learning stalls entirely. At $\lambda = 0.9$, the estimate is too biased toward the imperfect value net. Schulman et al. (2015) §6 shows exactly this curve.
stretchVerify the two limits empirically: $\lambda = 1$ should reproduce Monte-Carlo advantages (high variance, unbiased relative to your value fn); $\lambda = 0$ should reproduce the one-step TD advantage $\delta_t$. Plot the variance of advantages as a function of $\lambda$.
Show solution

At $\lambda = 1$: $\hat A_t = \sum_{k \ge 0} \gamma^k r_{t+k} - V(s_t) = G_t - V(s_t)$. At $\lambda = 0$: $\hat A_t = r_t + \gamma V(s_{t+1}) - V(s_t) = \delta_t$.

Plot np.var(advantages) for fixed trajectories as you sweep $\lambda$. Variance grows roughly monotonically in $\lambda$ — concrete numerical evidence of the bias-variance dial.

PyTorch / NumPy

import numpy as np

def gae(rewards, values, dones, gamma=0.99, lam=0.95, last_value=0.0):
    """
    rewards, values, dones : 1-D arrays of length T
        values[t] = V(s_t);   dones[t] indicates terminal at step t
    last_value : bootstrap V(s_T) for the final state (0 if terminal)
    Returns advantages and λ-returns (both length T).
    """
    T = len(rewards)
    advantages = np.zeros(T, dtype=np.float32)
    gae_t = 0.0
    for t in reversed(range(T)):
        next_v = last_value if t == T - 1 else values[t + 1]
        mask   = 1.0 - dones[t]
        delta  = rewards[t] + gamma * next_v * mask - values[t]
        gae_t  = delta + gamma * lam * mask * gae_t      # one-line recursion
        advantages[t] = gae_t
    returns = advantages + values        # λ-returns used as value targets
    return advantages, returns

§ VII

RLHF & LLM Post-Training

The Framework · 7.1

RLHF Pipeline

The three-stage recipe — SFT → reward model → RL — that turned base LLMs into assistants.

[ toggle ]

The KL-regularised objective

The RL step maximises expected reward while staying close to a frozen reference model $\pi_{\text{ref}}$: $$\max_{\pi_\theta} \; \mathbb{E}_{x \sim \mathcal{D},\; y \sim \pi_\theta(\cdot|x)}\big[r_\phi(x, y)\big] \;-\; \beta\, \mathbb{D}_{\text{KL}}\big(\pi_\theta(\cdot|x)\;\|\;\pi_{\text{ref}}(\cdot|x)\big)$$ Three stages: SFT — supervised fine-tuning on (prompt, high-quality response) pairs. Teaches format, instruction-following, conversational style. RM — train reward model $r_\phi$ on human preference pairs (see 7.2). RL — optimise $\pi_\theta$ against $r_\phi$ with the KL penalty above (see 7.3).

Theory & intuition

Why not just SFT Supervised fine-tuning needs target outputs. But for open-ended generation ("write a helpful response"), there's no single correct answer — and experts can't enumerate every good response. Preferences are easier to collect (rank A vs B) and carry relative quality information SFT can't express.

The KL penalty Is the load-bearing detail. Without it, the policy quickly drifts into gibberish that happens to score high on $r_\phi$ (reward hacking). With it, the policy stays close to the distribution the reward model was trained on — where its judgments are reliable. $\beta$ controls how tightly: too high and learning stalls, too low and the policy diverges.

Lineage Christiano et al. (2017) invented preference-based RL. Stiennon et al. (2020) applied it to summarisation. Ouyang et al. (2022) — InstructGPT — made it the core of ChatGPT. Bai et al. (2022) scaled it at Anthropic. The entire commercial LLM era rests on this 3-stage recipe.

Variants of the data source RLHF uses human labellers. RLAIF (Bai et al. 2022, Lee et al. 2023) uses another LLM as a judge. Constitutional AI generates preferences via explicit written principles. RLVR (see 7.7) replaces the reward model with programmatic verifiers.

The great unbundling In 2023–2026 every piece of this pipeline got challenged. DPO (7.4) removed the RM. GRPO (7.6) removed the value network. RLVR (7.7) removed the reward model. ORPO combined SFT and preference stages. The three-stage recipe is still the mental model, but in production only the first stage is untouched.

Exercises

warm-upPick a small base LM (e.g. GPT-2 small or TinyLlama). Run SFT on an instruction dataset like alpaca-cleaned. Qualitatively compare generations before and after.
Show solution

Use transformers' Trainer or trl's SFTTrainer. Format: <instruction>{prompt}<response>{answer}. Mask the instruction tokens in the loss — you only want to train on the response. 1–3 epochs, lr 2e-5 is a safe default.

Before SFT: the base LM continues prompts in weirdly formatted, often off-topic ways. After: it follows the instruction template and gives direct answers. This is the biggest single quality jump in the pipeline.
buildWrite out the end-to-end pipeline as a config: for each of the three stages, specify data type, loss function, which models are trainable vs frozen, and which are loaded in memory.
Show solution

Stage 1 (SFT): data = (instruction, response), loss = CE on response tokens, trainable = base LM, in-memory = 1 model. Stage 2 (RM): data = (prompt, chosen, rejected), loss = -log σ(r_w − r_l), trainable = RM (init from SFT), in-memory = 1. Stage 3 (RL): data = prompts, loss = clipped PPO + KL, trainable = policy + value, frozen = ref + RM, in-memory = 4.
stretchRead Ouyang et al. (2022, InstructGPT). Reproduce Figure 1 as a whiteboard diagram with your own annotations — loss functions, data types, and model sizes for all three stages. The point is to internalise the full recipe.
Show solution

Key details that often get glossed over: (a) SFT and RM were both 175B models in InstructGPT; (b) the RM saw ~30k–50k preference comparisons; (c) the RL stage used PPO with the per-token KL penalty and a small bit of pretraining-loss mixed in (PPO-ptx) to avoid catastrophic forgetting of generic language capabilities.

Pipeline sketch

# ── Stage 1: SFT (teach format + instruction following) ────────────
sft_model = train_next_token(base_lm, demonstrations,
                             loss=cross_entropy)

# ── Stage 2: Reward Modeling (see 7.2) ─────────────────────────────
rm = LM_with_scalar_head(sft_model)     # init from SFT
for batch in preference_pairs:          # (prompt, y_w, y_l)
    r_w = rm(batch.x, batch.y_w)
    r_l = rm(batch.x, batch.y_l)
    loss = -torch.log(torch.sigmoid(r_w - r_l)).mean()
    loss.backward(); optim.step()

# ── Stage 3: RL against rm, regularised toward sft_model ───────────
pi_theta = copy(sft_model)              # trainable policy
pi_ref   = freeze(sft_model)            # reference (frozen)

for batch in prompts:
    y = pi_theta.generate(batch.x)
    reward    = rm(batch.x, y)
    kl_token  = log_ratio(pi_theta, pi_ref, batch.x, y)   # per-token
    per_tok_r = reward_at_end(reward) - beta * kl_token
    # (see 7.3 for the full PPO update)
    ppo_update(pi_theta, value_net, batch.x, y, per_tok_r)

Preference → Scalar · 7.2

Reward Modeling (Bradley–Terry)

A 1952 statistical model turns "A is better than B" into a scalar reward function.

[ toggle ]

Bradley–Terry model

Assume there exists a latent reward $r_\phi(x, y)$ such that preferences follow: $$P(y_w \succ y_l \mid x) = \sigma\big(r_\phi(x, y_w) - r_\phi(x, y_l)\big)$$ Fit $\phi$ by maximum likelihood on preference triples $(x, y_w, y_l)$: $$\mathcal{L}_{\text{RM}} = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\Big[\log \sigma\big(r_\phi(x, y_w) - r_\phi(x, y_l)\big)\Big]$$

Theory & intuition

Where it comes from The Bradley–Terry–Luce model is 70 years old and underlies everything from chess Elo to tournament seeding. Its insight: you can't directly observe quality, but you can observe pairwise outcomes, and those outcomes reveal a latent scalar score up to an additive constant.

Why pairs, not ratings Asking humans "rate this response 1–10" produces noisy, uncalibrated, rater-dependent data — one person's 7 is another's 5. Asking "which of these two is better?" is cognitively easier, less noisy, and automatically normalised per-rater. Preference pairs are the data source RLHF was designed around.

Architecture In practice, $r_\phi$ is an LLM (usually initialised from the SFT model) with a scalar head replacing the LM head. The final-token hidden state is projected to a single number. Training is a few epochs on preference data — cheap compared to everything else.

Reward hacking The RM is a proxy, not a ground-truth measurement of quality. As the policy optimises against it, it finds regions of output space where the RM is wrong — responses humans don't prefer but that score high. Gao, Schulman & Hilton (2023) derived scaling laws for reward model over-optimisation: there's a KL-budget past which true quality starts to decrease.

Outcome vs process rewards A outcome RM scores a whole response. A process RM scores each step (useful for reasoning chains) — typically trained on human-annotated correctness of intermediate steps. Process RMs give denser signal and reduce credit-assignment difficulty, at the cost of much more expensive data.

Exercises

warm-upTrain a reward model on Anthropic/hh-rlhf starting from a small SFT model. Evaluate pairwise accuracy on a held-out split — target >70%.
Show solution

Attach a scalar head on top of the last non-pad-token's hidden state. Loss: -F.logsigmoid(r_w - r_l).mean(). Train 1 epoch, lr 5e-6, batch size 8–16 pairs.

Accuracy metric: fraction of held-out pairs where $r_w > r_l$. Chance is 50%; 70%+ is a reasonable small-model baseline. State-of-the-art RMs on hh-rlhf reach ~75–80%.
buildSwap the Bradley–Terry logistic loss for a margin loss $\max(0, m - (r_w - r_l))$. Compare convergence and final accuracy; sweep $m$.
Show solution

Hinge loss pushes hard on confusing pairs and stops entirely on easy ones (once $r_w - r_l > m$). It often converges faster but plateaus lower than BT — the BT loss keeps pushing even on easy pairs, extracting a bit more signal.

Sweet spot for $m$: 0.5–1.0. Too small: no push on confusing pairs. Too large: gradient saturates late in training.
stretchProbe your RM for reward hacking: run best-of-N sampling from your SFT model, scored by the RM. Inspect the top-scoring completions — are they actually good, or did the RM get fooled by length / formatting tics?
Show solution

Generate $N = 64$ candidates per prompt, score each, eyeball the top-1 vs the median. Classic failure modes: RM favours longer answers ("more words = more correct"), answers with bullet points, or answers that repeat the question.

A principled diagnostic: regress $r_\phi(y)$ on len(y) over held-out completions. If $R^2 > 0.3$, your RM has learned length as a major feature — hacking waiting to happen during PPO.

PyTorch

import torch, torch.nn as nn, torch.nn.functional as F

class RewardModel(nn.Module):
    def __init__(self, backbone):
        super().__init__()
        self.backbone = backbone                    # LM encoder
        self.head     = nn.Linear(backbone.hidden_size, 1, bias=False)

    def forward(self, input_ids, attention_mask):
        # Pool on the final non-pad token of each sequence.
        h = self.backbone(input_ids, attention_mask).last_hidden_state
        last = attention_mask.sum(dim=1) - 1        # index of last token
        h_last = h[torch.arange(h.size(0)), last]
        return self.head(h_last).squeeze(-1)         # (B,) scalar reward

def rm_loss(rm, batch):
    """batch: chosen_ids, chosen_mask, rejected_ids, rejected_mask."""
    r_w = rm(batch.chosen_ids,   batch.chosen_mask)
    r_l = rm(batch.rejected_ids, batch.rejected_mask)
    # Bradley–Terry: -log σ(r_w - r_l)
    return -F.logsigmoid(r_w - r_l).mean()

Token-level PPO · 7.3

PPO for LLMs

The RL step of classical RLHF — PPO, with each token treated as an action.

[ toggle ]

Per-token reward with KL penalty

The MDP: state $s_t$ is the prompt plus tokens $y_{<t}$; action $a_t = y_t$ is the next token. Reward is shaped by the RM at sequence end and the per-token KL penalty: $$r_t = \begin{cases} r_\phi(x, y) - \beta\,\log\dfrac{\pi_\theta(a_t | s_t)}{\pi_{\text{ref}}(a_t | s_t)} & t = T \\[4pt] -\,\beta\,\log\dfrac{\pi_\theta(a_t | s_t)}{\pi_{\text{ref}}(a_t | s_t)} & t < T \end{cases}$$ Compute advantages with GAE over token-level rewards, then apply the usual clipped PPO objective (see 3.4): $$\mathcal{L}^{\text{CLIP}}(\theta) = \mathbb{E}_t\big[\min(r_t(\theta) A_t,\; \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t)\big]$$

Theory & intuition

Four models in memory The canonical implementation loads policy $\pi_\theta$, reference $\pi_{\text{ref}}$ (frozen SFT), reward model $r_\phi$ (frozen), and a value network $V_\psi$ (trainable). For a 70B-parameter model, that's ~280B parameters live — the main reason GRPO (below) exists.

Why per-token KL Putting KL in the reward rather than as an outer constraint gives a dense, differentiable signal that steers every token — not just ones near the sequence end. Mathematically equivalent to a trajectory KL only under strong assumptions; the per-token version is what actually ships.

The value head Typically an extra scalar head on the policy backbone, trained on discounted returns. The advantage $\hat A_t = \hat R_t - V_\psi(s_t)$ is computed with GAE ($\lambda \approx 0.95$, $\gamma \approx 1$). The value head is a major memory cost, and in practice is often a source of instability.

Common pathologies Entropy collapse (policy becomes deterministic too fast); KL explosion (policy diverges from reference, reward hacking kicks in); mode collapse (same few responses regardless of prompt); length bias (longer responses get higher reward via spurious correlations). Each has a dedicated mitigation and an unhappy engineer.

Why it's being replaced The engineering burden is enormous — four models, careful $\beta$ tuning, distributed training, sensitivity to hyperparameters. DPO (7.4) skips the RL entirely; GRPO (7.6) skips the value net. Both now dominate what was once PPO territory.

Exercises

warm-upUsing trl's PPOTrainer, fine-tune GPT-2 with a sentiment classifier as the reward model. Task: make any prompt's continuation positive. Plot reward-over-steps.
Show solution

This is the TRL "PPO sentiment" tutorial. Reward model = a pretrained sentiment classifier (e.g. lvwerra/distilbert-imdb). The policy quickly learns to append positive-sounding completions. β around 0.02.

Watch for: reward climbs steeply for ~500 steps, then stalls. If it climbs to the classifier's max and stays there, you've likely reward-hacked — check generations qualitatively.
buildInstrument the four RLHF diagnostic metrics: KL-to-reference, mean reward, reward std, response length. Run one healthy and one collapsed training; overlay the diagnostics to see which signal catches the collapse first.
Show solution

Healthy run: KL slowly increases to a few nats, reward smoothly rises, length stable, reward std shrinks. Collapsed run: KL explodes past 10+ nats, length blows up (classic length-hacking), reward std drops to zero (mode collapse).

Usually KL-to-reference is the earliest warning sign — visible 10–100 steps before reward curves go weird. Set up auto-stopping if KL exceeds a threshold.
stretchTake a 7B model and fit full-PPO training (all four model copies) on a single A100 using LoRA on policy+value, 4-bit quantisation on reference+RM, and gradient checkpointing. Measure peak GPU memory and compare against full fine-tuning.
Show solution

Reference + RM loaded as frozen 4-bit via bitsandbytes (~4 GB each). Policy and value share a backbone with LoRA adapters (only adapters are trainable, ~50 MB extra). Gradient checkpointing trades compute for memory on the forward pass.

Full-FT 7B PPO needs ~140 GB (4 × 28 GB + activations). With this recipe: peak <50 GB. Fits on a single A100 80GB with headroom.

Core loop

def rlhf_ppo_step(pi_theta, pi_ref, rm, value, batch, beta=0.05):
    x = batch.prompts
    # ── 1. Generate
    y, logp_theta = pi_theta.generate_with_logprobs(x)
    with torch.no_grad():
        logp_ref = pi_ref.logprobs(x, y)
        reward_seq = rm(x, y)                         # scalar per sequence

    # ── 2. Per-token reward: KL penalty + terminal RM reward
    kl_per_tok = logp_theta - logp_ref                # (B, T)
    rewards = -beta * kl_per_tok
    rewards[:, -1] += reward_seq                      # RM reward at end

    # ── 3. Advantages via GAE on the value net
    with torch.no_grad():
        values = value(x, y)                          # (B, T)
    advantages, returns = gae(rewards, values, dones=None,
                              gamma=1.0, lam=0.95)

    # ── 4. PPO update (multiple epochs, mini-batches)
    for _ in range(ppo_epochs):
        logp_new = pi_theta.logprobs(x, y)
        ratio    = torch.exp(logp_new - logp_theta.detach())
        s1 = ratio * advantages
        s2 = torch.clamp(ratio, 1 - eps, 1 + eps) * advantages
        policy_loss = -torch.min(s1, s2).mean()
        value_loss  = F.mse_loss(value(x, y), returns)
        (policy_loss + 0.5 * value_loss).backward()
        optim.step(); optim.zero_grad()

Direct Preference Optimization · 7.4

DPO

"Your language model is secretly a reward model." No RM, no RL, no sampling — one closed-form loss.

[ toggle ]

The derivation

Under the KL-constrained RLHF objective, the optimal policy has closed form: $$\pi^*(y | x) = \frac{1}{Z(x)}\, \pi_{\text{ref}}(y | x)\, \exp\!\big(r(x, y) / \beta\big)$$ Invert this to express reward in terms of the optimal policy: $$r(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)$$ Substitute into Bradley–Terry — the partition function $Z(x)$ cancels — and you get a pure supervised loss: $$\mathcal{L}_{\text{DPO}} = -\mathbb{E}\left[\log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]$$

Theory & intuition

The pivot Rafailov et al. (NeurIPS 2023, best paper) noticed that the RM and the policy share the same information. Training an RM to match preferences, then training a policy to match the RM, is doing the same work twice. Because the KL-constrained RL problem has an analytic solution, you can skip the middleman and fit the policy to preferences directly.

What gets computed No sampling, no rollouts, no value network, no reward model. Just log-probability ratios of chosen vs rejected responses under the current policy and the frozen reference, fed into a sigmoid cross-entropy. Training looks indistinguishable from supervised fine-tuning.

What $\beta$ does Same role as in RLHF — controls how tightly $\pi_\theta$ stays near $\pi_{\text{ref}}$. Larger $\beta$: stronger preference signal, but also stronger regularisation anchoring to the reference. In DPO it's effectively a temperature on the implicit reward.

Failure modes DPO's convenience is also its trap. It can decrease the probability of both winner and loser (if lowering the loser's probability more than the winner's reduces the loss). It overfits to static preference data because there's no online exploration. It doesn't handle on-policy shift — you can't collect fresh preferences and iterate the way you can with PPO.

Descendants The DPO lineage has exploded: IPO (Azar 2023, fixes overfitting), cDPO (conservative), SimPO (Meng 2024, reference-free), ORPO (Hong 2024, folds in SFT), KTO (7.5, binary labels). Each swaps one limitation for another.

Status Now the default first-pass alignment method in most open-weight LLM releases. When preference data is available and the reference model is strong, DPO is faster, cheaper, and more stable than PPO-based RLHF.

Exercises

warm-upStarting from a small SFT model, run DPO on UltraFeedback. Plot the implicit-reward margin $\mathbb{E}[r_w - r_l]$ and chosen/rejected accuracy over training.
Show solution

Use trl's DPOTrainer. $\beta = 0.1$, lr 5e-7 (much lower than SFT — DPO can easily over-train). 1 epoch is usually enough.

Expected: margin climbs from 0 to ~5 nats; both $\log \pi_\theta(y_w)$ and $\log \pi_\theta(y_l)$ often decrease, but $\log \pi_\theta(y_l)$ drops faster. This is normal — DPO's gradient is about the ratio, not absolute probabilities.
buildSweep $\beta \in \{0.01, 0.1, 0.5, 1.0\}$. Plot KL-to-reference vs a benchmark score (e.g. AlpacaEval) — you should recover the classic U-shape where both extremes underperform.
Show solution

Small $\beta$ (0.01): policy drifts far from reference, KL is huge, model stops behaving like an assistant. Large $\beta$ (1.0): policy barely moves, benchmark score barely above SFT. Sweet spot ~0.1–0.3 for most datasets.

Plot both axes as a function of $\beta$. The "Pareto front" of (low KL, high benchmark) is what you're tracing out — the classic over-optimisation trade-off from Gao et al. 2023, same as in PPO-RLHF.
stretchImplement IPO (Azar 2023): replace DPO's logistic loss with $\big(h_\theta - \tfrac{1}{2\beta}\big)^2$ where $h_\theta$ is the log-ratio difference. Inject label noise (flip 10% of preferences) and compare DPO vs IPO robustness.
Show solution
```
h = (logp_w - ref_w) - (logp_l - ref_l)    # log-ratio diff
ipo_loss = (h - 1 / (2 * beta)).pow(2).mean()
```
DPO's logistic loss has no upper bound — it keeps pushing noisy-label pairs indefinitely, memorising them. IPO's quadratic loss caps the update when the margin hits the target, so a few flipped labels can't distort the whole fit.

Expected: at 10% noise, DPO overfits and benchmark drops; IPO degrades more gracefully. Gap grows with noise rate.

PyTorch (the whole algorithm)

import torch, torch.nn.functional as F

def dpo_loss(pi_theta, pi_ref, batch, beta=0.1):
    """batch.{prompt, chosen, rejected} with attention masks etc."""
    # log-probs of chosen / rejected under current policy
    logp_w = pi_theta.sequence_logprob(batch.prompt, batch.chosen)
    logp_l = pi_theta.sequence_logprob(batch.prompt, batch.rejected)

    # same under the frozen reference (no grad)
    with torch.no_grad():
        ref_w = pi_ref.sequence_logprob(batch.prompt, batch.chosen)
        ref_l = pi_ref.sequence_logprob(batch.prompt, batch.rejected)

    # implicit rewards: β · log π_θ / π_ref
    r_w = beta * (logp_w - ref_w)
    r_l = beta * (logp_l - ref_l)

    # Bradley–Terry on implicit rewards
    loss = -F.logsigmoid(r_w - r_l).mean()

    # useful diagnostics
    with torch.no_grad():
        margin       = (r_w - r_l).mean()
        chosen_acc   = ((r_w > r_l).float()).mean()
    return loss, {"margin": margin, "acc": chosen_acc}

Prospect-Theoretic Preferences · 7.5

KTO

Thumbs up / thumbs down is all you need — and loss aversion is a feature.

[ toggle ]

Utility with asymmetric gain/loss

Implicit reward as in DPO: $r_\theta(x, y) = \beta \log \dfrac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}$ Reference point $z_0(x) = \mathbb{E}_{y' \sim \pi_\theta}[r_\theta(x, y')]$ — estimated in batch. Per-example loss, depending on whether $y$ is desirable or undesirable: $$\mathcal{L}_{\text{KTO}}(x, y) = \begin{cases} \lambda_D \big(1 - \sigma(\beta(r_\theta(x,y) - z_0))\big) & y \text{ desirable} \\ \lambda_U \big(1 - \sigma(\beta(z_0 - r_\theta(x,y)))\big) & y \text{ undesirable} \end{cases}$$ $\lambda_D, \lambda_U$ are the gain/loss weights (typically $\lambda_U > \lambda_D$ — loss aversion).

Theory & intuition

The data problem DPO has Preference pairs are expensive and unnatural. Real product feedback is unary: a thumbs-up, a thumbs-down, a regeneration click. Paired data has to be constructed, often synthetically. DPO wastes most of the real-world feedback signal by insisting on pairs.

Prospect theory Kahneman & Tversky showed (Nobel-prize work) that humans don't evaluate outcomes linearly — they feel losses ~2× more strongly than equivalent gains, relative to a reference point. KTO (Ethayarajh et al., 2024) encodes this explicitly: desirable examples pull the policy up, undesirable examples push it down harder.

The reference point $z_0$ Is not a single fixed value — it's approximated as the average implicit reward over a batch. Intuitively: "is this response better than a typical one under the current policy?" This makes the utility scale-free and self-normalising.

What you get KTO matches or beats DPO on standard benchmarks while using data that's roughly an order of magnitude cheaper to collect. It's the method of choice when you have natural binary feedback — post-deployment click data, moderation flags, satisfaction signals.

Caveat KTO doesn't handle the explicit comparison signal that preference pairs carry. When you have curated pairs from expert annotators, DPO is still competitive. The two methods solve different data regimes.

Exercises

warm-upConvert a pairwise preference dataset into a KTO dataset by labelling winners as desirable and losers as undesirable (breaking the pair structure). Train a small KTO model and verify its generations beat the SFT baseline.
Show solution

Use trl's KTOTrainer. For each pair $(x, y_w, y_l)$, emit two rows: $(x, y_w, \text{True})$ and $(x, y_l, \text{False})$. Shuffle.

At first pass, KTO with this converted data should come close to DPO quality even though it discarded the pair coupling. That's the "for free" signal.
buildMake the dataset intentionally class-imbalanced (e.g. 20% desirable, 80% undesirable). Tune $\lambda_D / \lambda_U$ to compensate; show a plot of win-rate against SFT as a function of the ratio.
Show solution

Ethayarajh et al. recommend $\tfrac{\lambda_D \cdot |D|}{\lambda_U \cdot |U|} \in [1, 4/3]$. For a 20:80 split, use $\lambda_D \approx 4, \lambda_U \approx 1$.

Plot win-rate vs the ratio $\lambda_D / \lambda_U$; expect a peaked curve. Default equal weights usually underperforms badly on imbalanced data because the loss is dominated by whichever class is more numerous.
stretchOn the same preference dataset, train both DPO and KTO. Report pairwise-preference accuracy against held-out pairs, plus win rate against SFT as judged by an LLM-judge. Note when and why the two methods disagree.
Show solution

Pairwise accuracy tends to slightly favour DPO (it was trained on exactly this signal). LLM-judge win rate can go either way — KTO often wins on tasks where the original pairs were noisy (because KTO's prospect-theoretic loss is less distracted by individual noisy pairs).

Interesting diagnostic: look at examples where DPO and KTO disagree the most. These tend to be pairs where the two responses are nearly equally good — DPO picks one to prefer, KTO treats both as acceptable.

PyTorch

import torch, torch.nn.functional as F

def kto_loss(pi_theta, pi_ref, batch, beta=0.1,
             lam_D=1.0, lam_U=1.0):
    """
    batch.y is a mix of desirable and undesirable completions;
    batch.label is 1 for desirable, 0 for undesirable.
    """
    logp     = pi_theta.sequence_logprob(batch.prompt, batch.y)
    with torch.no_grad():
        logp_ref = pi_ref.sequence_logprob(batch.prompt, batch.y)

    r = beta * (logp - logp_ref)                  # implicit reward

    # z_0: batch-estimated KL-based reference point
    z0 = r.detach().mean()

    # asymmetric utility
    desirable    = batch.label.bool()
    loss_desired = lam_D * (1 - torch.sigmoid( beta * (r -  z0)))
    loss_undesir = lam_U * (1 - torch.sigmoid( beta * (z0 -  r)))
    loss = torch.where(desirable, loss_desired, loss_undesir).mean()
    return loss

Group-Relative PPO · 7.6

GRPO

DeepSeek's critic-free variant of PPO — let the group be its own baseline.

[ toggle ]

Group-relative advantage

For each prompt $x$, sample a group of $K$ completions $\{y_i\}_{i=1}^K$. Score each with any reward function $r$ (RM, RLVR verifier, or both). Advantage uses the group itself as a baseline — no value network: $$A_i = \frac{r_i - \mathrm{mean}(r_1, \ldots, r_K)}{\mathrm{std}(r_1, \ldots, r_K) + \varepsilon}$$ Every token in completion $i$ inherits the same advantage $A_i$. Apply PPO-style clipped loss, with KL penalty moved into the loss: $$\mathcal{L}_{\text{GRPO}} = -\mathbb{E}_t\big[\min\big(\rho_t A_i,\; \mathrm{clip}(\rho_t, 1-\epsilon, 1+\epsilon)\, A_i\big)\big] + \beta\, \mathbb{D}_{\text{KL}}\big(\pi_\theta\,\|\,\pi_{\text{ref}}\big)$$ where $\rho_t = \pi_\theta(a_t|s_t) / \pi_{\theta_\text{old}}(a_t|s_t)$.

Theory & intuition

Origin Introduced by Shao et al. (2024) in the DeepSeekMath paper, then made famous as the algorithm behind DeepSeek-R1 (2025) — the first open-weight model to match frontier reasoning performance, trained almost entirely with GRPO on verifiable rewards.

Why drop the critic In LLM-scale RLHF, the value network is another multi-billion-parameter model in memory. Removing it cuts memory and compute substantially. The question is whether you can do without a learned baseline — GRPO's answer: yes, if you sample a group per prompt, the group mean is a baseline.

Why group baselines work Any baseline $b(s)$ that doesn't depend on the action leaves the policy gradient unbiased (same reason as REINFORCE). The group mean satisfies this — and being sample-based, it tracks the current policy automatically without any separate training.

Same advantage on every token This is a minor heresy. Classical PPO has a per-token advantage; GRPO assigns the whole completion the same value. It works because for discrete, sparse-reward tasks (math/code), a single sequence-level signal is usually what you have anyway. GSPO and TIC-GRPO (2025) formalise this as sequence-level importance sampling.

KL in the loss, not the reward GRPO puts KL directly in the loss as an explicit regulariser, rather than shaping per-token rewards. Differentiable, transparent, and easier to tune. But — the gradient of a sample-based KL estimator is subtle; recent work (Tang & Munos 2025, Shah et al. 2026) shows some popular libraries compute biased gradients here.

When to use it GRPO shines when rewards are cheap and verifiable (math answers, unit tests, logical proofs). Pair it with RLVR (7.7) and you get the recipe behind essentially every reasoning model released since R1.

Exercises

warm-upImplement GRPO from scratch on GSM8K with a small math-capable LM. Group size $K = 8$, $\beta = 0.04$, binary reward (1 if final answer matches gold). Target a measurable bump in accuracy over SFT.
Show solution

Per prompt: sample 8 completions at temperature 0.8–1.0 (too low ⇒ all identical ⇒ no gradient; too high ⇒ all garbage). Score each. Compute (r - r.mean()) / (r.std() + 1e-8). Apply clipped surrogate × advantage, plus KL-to-reference.

Even a small (1.5B) SFT model tends to gain 5–10 percentage points on GSM8K with well-tuned GRPO. Watch for reward plateauing when most groups are all-correct or all-wrong (dynamic sampling will help, see §7.7).
buildCompare two KL estimators inside the loss: $k_2$ (squared log-ratio, biased but simple) vs $k_3$ (Schulman's unbiased estimator $r - 1 - \log r$). Log gradient norms and policy stability for each.
Show solution
```
log_ratio = logp_new - logp_ref          # per token
k2 = 0.5 * log_ratio.pow(2).mean()       # biased, commonly used
k3 = (torch.exp(log_ratio) - 1 - log_ratio).mean()  # unbiased, ≥0
```
$k_3$ is always non-negative (unlike $k_2$'s gradient, which can be negative in practice). Tang & Munos (2025) showed that even unbiased estimators can give biased gradients — $k_3$ still behaves better in practice. Expect smoother training with $k_3$.
stretchSweep group size $K \in \{4, 8, 16, 32, 64\}$ on your reasoning benchmark. Plot accuracy vs total generation tokens, not vs steps — the sweet spot for a given compute budget is often $K = 8$ or $16$, not bigger.
Show solution

Larger $K$ gives a better baseline estimate but scales generation cost linearly. Plotting accuracy-vs-FLOPs (rather than vs steps) reveals diminishing returns past $K = 16$.

Also: as $K$ grows, the fraction of all-correct or all-incorrect groups rises — more samples are "wasted" on easy/hard prompts that provide no gradient. This motivates DAPO's dynamic sampling.

Core update

def grpo_step(pi_theta, pi_ref, reward_fn, batch, K=16,
              beta=0.04, clip=0.2, ppo_epochs=4):
    """
    batch.prompts: (B,) list of prompts.
    reward_fn: maps (x, y) -> scalar reward (e.g. verifier or RM).
    """
    x = batch.prompts
    # ── 1. Rollouts: K completions per prompt
    y, logp_old = pi_theta.generate_group(x, K=K)        # (B, K, T)
    rewards = torch.tensor([[reward_fn(x[b], y[b, i])
                              for i in range(K)]
                             for b in range(len(x))])     # (B, K)

    # ── 2. Group-relative advantages
    mu  = rewards.mean(dim=1, keepdim=True)
    sig = rewards.std(dim=1,  keepdim=True) + 1e-8
    A   = (rewards - mu) / sig                            # (B, K)
    A   = A.unsqueeze(-1).expand_as(logp_old)             # broadcast to tokens

    # ── 3. PPO-clip updates (no value loss — no critic!)
    for _ in range(ppo_epochs):
        logp_new = pi_theta.token_logprobs(x, y)          # (B, K, T)
        ratio    = torch.exp(logp_new - logp_old.detach())
        surr1    = ratio * A
        surr2    = torch.clamp(ratio, 1 - clip, 1 + clip) * A
        policy_loss = -torch.min(surr1, surr2).mean()

        # KL regulariser directly in the loss
        with torch.no_grad():
            logp_ref = pi_ref.token_logprobs(x, y)
        kl = (logp_new - logp_ref).mean()

        loss = policy_loss + beta * kl
        loss.backward(); optim.step(); optim.zero_grad()

Verifiable Rewards · 7.7

RLVR & DAPO

Skip the reward model entirely — let a programmatic verifier grade every rollout.

[ toggle ]

Verifiable reward

Replace the learned reward model $r_\phi$ with a programmatic verifier: $$r(x, y) = \begin{cases} 1 & \text{if } \mathrm{verify}(x, y) \text{ succeeds} \\ 0 & \text{otherwise} \end{cases}$$ Examples of $\mathrm{verify}$: Math: parse the final answer, check equality with ground truth. Code: run unit tests, binary pass/fail. Formal logic: check proof validity with a theorem prover. Instruction-following: regex/format compliance. Train with GRPO (7.6) — RLVR is the reward source, GRPO is the optimiser.

DAPO — four stability fixes for long-CoT training

Yu et al. (ByteDance & Tsinghua, 2025) diagnosed failure modes specific to long reasoning traces and patched GRPO with: Clip-Higher — asymmetric clipping ($1-\epsilon_{\text{low}}, 1+\epsilon_{\text{high}}$ with $\epsilon_{\text{high}} > \epsilon_{\text{low}}$) so the policy can still push up rare high-reward tokens without collapsing entropy. Dynamic Sampling — drop prompts where all group rewards are identical (no gradient signal); refill the batch with informative ones. Token-level policy loss — per-token rather than per-sequence averaging, to keep gradients well-scaled when response lengths vary wildly (as they do in CoT). Overlong reward shaping — smooth penalty for responses that exceed the length limit, instead of a hard cutoff that introduces reward noise.

Theory & intuition

The paradigm shift Classical RLHF assumed quality was subjective — you needed humans (or an RM trained on humans) to judge. RLVR recognises that for a growing set of tasks — math, code, proofs, structured outputs — correctness is objective and automatically checkable. When it is, there's no reason to introduce a fallible human-preference proxy.

R1's surprise DeepSeek-R1-Zero (early 2025) was trained purely with RLVR on math and code, with no SFT stage at all. It developed long chain-of-thought reasoning, self-verification, and backtracking emergently — not from imitation, but because those behaviours got rewarded by the verifier. This shocked the field and launched a wave of reasoning-model work.

No reward hacking The verifier is ground truth, not a proxy. A correct answer is a correct answer; there's no latent quality dimension to exploit. This is an enormous simplification: you can train much more aggressively (high $\beta$, many steps) without fear of the model finding a degenerate solution the RM mislabels.

What DAPO teaches us Long chain-of-thought responses (thousands of tokens) break assumptions built into PPO/GRPO for shorter completions. Variance in sequence length creates gradient imbalances; low-probability-but-correct reasoning tokens get clipped away; empty reward regions waste compute. DAPO's four fixes each address a specific failure mode observed in the DeepSeek-R1 reproduction effort.

Results DAPO trained Qwen2.5-32B to 50 points on AIME 2024 with half the steps of DeepSeek-R1-Zero — and open-sourced the full system. GRPO + RLVR + DAPO-style tricks is now the standard stack for reasoning-model post-training.

Limits of verifiability RLVR works where verification is easier than generation. For open-ended writing, dialogue, or subjective quality, you still need an RM (or DPO/KTO on preferences). The emerging consensus: use RLVR-GRPO for reasoning, DPO/KTO for style/alignment — and chain the two stages.

Exercises

warm-upWrite three RLVR verifiers: math (regex-extract boxed answer, compare to gold), code (execute in subprocess with timeout, run unit tests), and format (regex check for "Answer: X"). Make them deterministic and fast (<100 ms).
Show solution

Math: use sympy.simplify on both sides for symbolic equality — plain string compare misses cases like "1/2" vs "0.5". Code: always run with subprocess.run(..., timeout=5); never exec() untrusted code in the main process.

Build a unit test harness: for each verifier, a handful of (input, expected_reward) cases. Run these every time the verifier code changes — a buggy verifier silently corrupts RL training.
buildImplement DAPO's Clip-Higher ($\varepsilon_{\text{low}} = 0.2, \varepsilon_{\text{high}} = 0.28$) inside your GRPO loop. Run vanilla GRPO vs Clip-Higher on a reasoning task; plot policy entropy over training — Clip-Higher should preserve entropy noticeably longer.
Show solution
```
surr1 = ratio * advantage
surr2 = torch.clamp(ratio, 1 - eps_low, 1 + eps_high) * advantage
loss  = -torch.min(surr1, surr2).mean()
```
Track policy entropy per step. Vanilla GRPO's entropy typically plummets within 100 steps (entropy collapse); Clip-Higher keeps it an order of magnitude higher for the same training duration. Preserved entropy directly correlates with ability to keep discovering new reasoning paths.
stretchImplement dynamic sampling: per batch, drop prompts where all $K$ rollouts got the same reward (no gradient signal) and refill from a larger prompt pool. Log % filtered per step — this fraction should climb late in training as the model gets better.
Show solution
```
def filter_informative(prompts, rewards_per_group):
    keep = [p for p, r in zip(prompts, rewards_per_group)
            if r.std() > 1e-6]
    return keep
```
Overflow strategy: maintain a prompt pool ~3× the target batch size. After filtering, if the batch is short, draw more prompts from the pool and re-rollout until the target size is met.

Early training: ~5–10% filtered. Late training on solved tasks: up to 50–70%. Without dynamic sampling, most of your generation compute is wasted on these null-gradient prompts — hence DAPO's 50% step reduction on AIME.

Verifier examples + Clip-Higher

import re, subprocess, signal

# ── Example verifiers for RLVR rewards ───────────────────────
def math_verify(question, response, gold_answer):
    """Extract final boxed answer and check equality."""
    m = re.search(r'\\boxed\{([^}]*)\}', response)
    if not m: return 0.0
    return float(m.group(1).strip() == gold_answer.strip())

def code_verify(problem, response, tests, timeout=5):
    """Run model-written code against unit tests."""
    code = extract_code_block(response)
    try:
        result = subprocess.run(
            ["python", "-c", code + "\n" + tests],
            capture_output=True, timeout=timeout, text=True)
        return 1.0 if result.returncode == 0 else 0.0
    except subprocess.TimeoutExpired:
        return 0.0

# ── DAPO: Clip-Higher inside the GRPO loss ───────────────────
def dapo_clipped_loss(ratio, advantages,
                      eps_low=0.2, eps_high=0.28):
    """
    Asymmetric clipping: keep push-UP room so rare good tokens
    aren't squashed; clip the downside symmetrically.
    """
    surr1 = ratio * advantages
    # Lower bound tight (1 - eps_low), upper bound looser (1 + eps_high)
    surr2 = torch.clamp(ratio, 1 - eps_low, 1 + eps_high) * advantages
    return -torch.min(surr1, surr2).mean()

# ── DAPO: dynamic sampling (skip degenerate prompts) ─────────
def filter_informative(prompts, rewards_per_group):
    """Drop prompts where all completions got the same reward."""
    keep = []
    for p, r_group in zip(prompts, rewards_per_group):
        if r_group.std() > 1e-6:    # non-degenerate
            keep.append(p)
    return keep

ReinforcementLearning

Foundations

Markov Decision Process

Value Iteration

Model-Free, Value-Based

SARSA

Q-Learning

Deep Q-Network (DQN)

Double DQN

Dueling DQN

Policy Gradient & Actor-Critic

REINFORCE

A2C (Advantage Actor-Critic)

TRPO

PPO

Continuous Control

DDPG

TD3

SAC (Soft Actor-Critic)

Model-Based RL

Dyna-Q

Monte Carlo Tree Search

PETS

MBPO

Dreamer (V1 / V2 / V3)

MuZero

Eligibility Traces & TD(λ)

n-step TD

TD(λ)

SARSA(λ)

Watkins's Q(λ)

True Online TD(λ)

GAE — Generalised Advantage Estimation

RLHF & LLM Post-Training

RLHF Pipeline

Reward Modeling (Bradley–Terry)

PPO for LLMs

DPO

KTO

GRPO

RLVR & DAPO

Reinforcement
Learning