Markov Decision Process
An MDP is a tuple $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$:
- $\mathcal{S}$ — state space
- $\mathcal{A}$ — action space
- $P(s' \mid s, a)$ — transition probabilities
- $R(s, a)$ — reward function
- $\gamma \in [0, 1)$ — discount factor
Return: $G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k+1}$
State-value: $V^\pi(s) = \mathbb{E}_\pi[G_t \mid s_t = s]$
Action-value: $Q^\pi(s, a) = \mathbb{E}_\pi[G_t \mid s_t = s, a_t = a]$
Expectation (policy $\pi$):
$$V^\pi(s) = \sum_a \pi(a|s) \Big[R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^\pi(s')\Big]$$
Optimality (for $Q^*$):
$$Q^*(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) \max_{a'} Q^*(s', a')$$
Why Markov The Markov property — the future depends only on the present state, not on history — is what makes RL tractable. Without it, value functions wouldn't be well-defined on states alone; you'd have to condition on the entire trajectory, and everything that follows would collapse.
Why γ<1 The discount factor serves two purposes. Mathematically, it makes the Bellman operator a contraction, guaranteeing a unique fixed point and geometric convergence. Behaviourally, it encodes how much the agent values the distant future; $\gamma \to 1$ gives a far-sighted agent, $\gamma \to 0$ a myopic one.
V vs Q Both encode the same object, but answer different questions. $V(s)$ asks "how good is this state?"; $Q(s,a)$ asks "how good is this action in this state?" — and the latter is what you need for model-free control, because you can choose actions without simulating transitions.
Two problems RL really splits into two tasks: prediction (given $\pi$, find $V^\pi$) and control (find $\pi^*$). Every algorithm that follows is either one or the other — or, like policy iteration, both alternating.
- warm-upWrite
bellman_expectation(V, P, R, gamma, pi)performing one synchronous sweep of the Bellman expectation operator. Test on a 3-state chain with a uniform-random policy.Show solution
For each state, sum over actions weighted by $\pi(a|s)$, then over next states weighted by $P(s'|s,a)$:
def bellman_expectation(V, P, R, gamma, pi): V_new = np.zeros_like(V) for s in range(len(V)): for a in range(pi.shape[1]): V_new[s] += pi[s, a] * (R[s, a] + gamma * P[s, a] @ V) return V_newSanity check: under a uniform policy, your iterated values should converge to the closed-form solution $V^\pi = (I - \gamma P^\pi)^{-1} R^\pi$.
- buildVerify the contraction property empirically. Initialise two random value functions $V_0, V_0'$, apply the Bellman optimality operator repeatedly to both, and plot $\|V_k - V_k'\|_\infty$. It should decay by factor $\gamma$ each iteration.
Show solution
Apply $T^*V(s) = \max_a [R(s,a) + \gamma P(\cdot|s,a)^\top V]$ to both vectors. Plot $\log\|V_k - V_k'\|_\infty$ versus $k$ — the slope is $\log \gamma$. This is the Banach contraction in action.
Subtle point: $T^*$ is a contraction in sup-norm, not L2. Use
np.max(np.abs(...)), notnp.linalg.norm. - stretchConstruct a 5-state MDP whose optimal policy under $\gamma = 0.5$ differs from the optimal policy under $\gamma = 0.99$. Solve both via policy iteration and inspect the policies.
Show solution
Idea: two paths from start, one giving a small immediate reward then loops, the other paying a small cost now and a big reward several steps later. Example: Action A yields $+1$ now and transitions to a self-loop at $0$. Action B yields $-0.5$ and after 3 steps reaches a $+10$ terminal state.
Under $\gamma = 0.5$: discounted $B$-reward is $\approx 0.5^3 \cdot 10 - 0.5 = 0.75$, but $A$-reward is $1$ — A wins. Under $\gamma = 0.99$: $B$'s discounted value $\approx 9.7 - 0.5$ dwarfs $A$'s $1$ — B wins.