- From Imitation to Reinforcement
- Deriving the Policy Gradient
- The REINFORCE Algorithm
- The Variance Problem
- Variance Reduction I: Reward-to-Go (Causality)
- Variance Reduction II: Baselines
- Implementing Policy Gradients in Practice
- Off-Policy Policy Gradients
- Summary
From Imitation to Reinforcement
In the previous post, we trained a policy by imitating expert demonstrations. The training loop was straightforward: collect expert data, compute a gradient that pushes the policy toward every expert action, update the parameters, repeat.
Policy gradients follow the exact same loop — with two changes:
- The agent collects its own trajectories instead of copying an expert
- The gradient is weighted by reward — good actions get stronger updates, bad actions get weaker ones (or get pushed away)
This is a small change to the math, but a huge change in capability. Imitation learning has a ceiling — you can never outperform the demonstrator. With policy gradients, the agent explores on its own and can discover strategies the expert never showed. We don’t need an expert anymore. We just need a reward signal.
The Formal Objective
We want to find a policy $\pi_\theta$ that maximizes expected cumulative reward:
\[J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \sum_{t=1}^{T} r(\mathbf{s}_t, \mathbf{a}_t) \right]\]where a trajectory $\tau = (\mathbf{s}_1, \mathbf{a}_1, \mathbf{s}_2, \mathbf{a}_2, \ldots, \mathbf{s}_T, \mathbf{a}_T)$ is generated by the policy interacting with the environment.
The training loop:
- Run the current policy to collect a batch of trajectories
- Estimate the gradient $\nabla_\theta J(\theta)$
- Update $\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)$
- Repeat
The catch is step 2. How do we actually compute this gradient?
Deriving the Policy Gradient
We want $\nabla_\theta J(\theta)$, but $\theta$ influences $J$ through the distribution of trajectories $p_\theta(\tau)$. We can’t differentiate through the environment’s dynamics — we don’t even know what they are.
The Log-Derivative Trick
Start with the integral form:
\[\nabla_\theta J(\theta) = \nabla_\theta \int p_\theta(\tau) \, r(\tau) \, d\tau = \int \nabla_\theta p_\theta(\tau) \, r(\tau) \, d\tau\]Now use the identity $\nabla_\theta p_\theta(\tau) = p_\theta(\tau) \nabla_\theta \log p_\theta(\tau)$:
\[\nabla_\theta J(\theta) = \int p_\theta(\tau) \, \nabla_\theta \log p_\theta(\tau) \, r(\tau) \, d\tau = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \nabla_\theta \log p_\theta(\tau) \, r(\tau) \right]\]This is the key move — we’ve turned a gradient-of-an-expectation into an expectation-of-a-gradient, which means we can estimate it with samples.
Eliminating the Dynamics
The trajectory probability decomposes as:
\[p_\theta(\tau) = p(\mathbf{s}_1) \prod_{t=1}^{T} \pi_\theta(\mathbf{a}_t | \mathbf{s}_t) \, p(\mathbf{s}_{t+1} | \mathbf{s}_t, \mathbf{a}_t)\]Taking the log:
\[\log p_\theta(\tau) = \log p(\mathbf{s}_1) + \sum_{t=1}^{T} \log \pi_\theta(\mathbf{a}_t | \mathbf{s}_t) + \sum_{t=1}^{T} \log p(\mathbf{s}_{t+1} | \mathbf{s}_t, \mathbf{a}_t)\]When we take $\nabla_\theta$, the initial state distribution and the dynamics terms vanish — they don’t depend on $\theta$. We’re left with:
\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \left( \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(\mathbf{a}_t | \mathbf{s}_t) \right) \left( \sum_{t=1}^{T} r(\mathbf{s}_t, \mathbf{a}_t) \right) \right]\]We never need to know the dynamics. We only need to be able to sample trajectories and evaluate $\log \pi_\theta(\mathbf{a} \mid \mathbf{s})$ — which is just the neural network’s output.
The REINFORCE Algorithm
Approximating the expectation with $N$ sampled trajectories gives us the vanilla policy gradient (a.k.a. REINFORCE):
\[\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \left( \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(\mathbf{a}_{t}^i | \mathbf{s}_{t}^i) \right) \left( \sum_{t=1}^{T} r(\mathbf{s}_{t}^i, \mathbf{a}_{t}^i) \right)\]The full algorithm:
- Sample $\lbrace\tau^i\rbrace$ from $\pi_\theta(\mathbf{a}_t \mid \mathbf{s}_t)$
- Compute $\nabla_\theta J(\theta)$ using the trajectories
- $\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)$
What Does This Actually Mean?
Compare the REINFORCE gradient to the imitation learning gradient:
| Imitation Learning | REINFORCE | |
|---|---|---|
| Data | Expert demonstrations | Self-generated rollouts |
| Gradient | $\sum \nabla_\theta \log \pi_\theta(\mathbf{a} \mid \mathbf{s})$ | $\sum \nabla_\theta \log \pi_\theta(\mathbf{a} \mid \mathbf{s}) \cdot R(\tau)$ |
| Intuition | Copy all actions equally | Copy actions, weighted by reward |
REINFORCE is just imitation learning on your own data, where you “imitate yourself more” when things go well and “imitate yourself less” when things go badly. Do more of the good stuff, less of the bad stuff.
The Variance Problem
Example: Learning to Walk
Consider a simulated humanoid learning to walk, where $r(\mathbf{s}, \mathbf{a})$ = forward velocity.
Suppose we sample 5 trajectories:
| Trajectory | What Happens | Reward |
|---|---|---|
| $\tau^1$ | Falls backwards | Negative |
| $\tau^2$ | Falls forwards | Positive |
| $\tau^3$ | Stands still | Zero |
| $\tau^4$ | One step forward, then falls back | Negative |
| $\tau^5$ | Large step back, then small step forward | Negative |
The gradient will increase the probability of falling forward (the highest-reward trajectory) and decrease the rest. The agent learns to faceplant instead of walk, because with only a few samples, “falling forward” looks like the best strategy.
This is the core issue: policy gradients are noisy and high-variance. With limited samples, the gradient is a rough approximation that can lead to strange local optima.
Variance Reduction I: Reward-to-Go (Causality)
The vanilla gradient weights action $\mathbf{a}_t$ by the total trajectory reward — including rewards from timesteps before $t$. But an action at $t=100$ can’t possibly affect the reward at $t=10$. This violates causality and adds pure noise.
The fix: only weight each action by rewards from that point onward (the reward-to-go):
\[\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(\mathbf{a}_{i,t} | \mathbf{s}_{i,t}) \left( \sum_{t'=t}^{T} r(\mathbf{s}_{i,t'}, \mathbf{a}_{i,t'}) \right)\]Notice the inner sum now starts at $t’ = t$ instead of $t’ = 1$. This is still an unbiased estimator, but with significantly lower variance.
Going back to the walking example — trajectory $\tau^5$ (“large step backwards then small step forwards”) had a negative total reward. With vanilla REINFORCE, the gradient would push against the final small-step-forward action, even though that action was actually good. With reward-to-go, the forward step only sees the rewards it caused, and gets properly reinforced.
Variance Reduction II: Baselines
The All-Positive Reward Problem
Consider a robot learning to fold a jacket, with rewards:
\[r(\mathbf{s}, \mathbf{a}) = \begin{cases} 1.0 & \text{neatly folded} \\ 0.5 & \text{folded with wrinkles} \\ 0 & \text{not folded} \end{cases}\]If all sampled trajectories get reward $\geq 0$, the gradient will increase the likelihood of everything — even the “do nothing” trajectories. It will just push up the good actions slightly more than the bad ones. This is slow and wasteful.
Subtracting a Baseline
We subtract a baseline $b$ from the reward:
\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \nabla_\theta \log p_\theta(\tau) \left( r(\tau) - b \right) \right]\]A natural choice: $b = \frac{1}{N} \sum_{i=1}^{N} r(\tau^i)$ (the average reward across the batch).
Now actions are judged as “better than average” or “worse than average.” Trajectories that score below the mean get negative gradients, actively pushing the policy away from those actions.
Is this valid? Yes — subtracting any constant baseline leaves the gradient unbiased:
\[\mathbb{E}\left[\nabla_\theta \log p_\theta(\tau) \cdot b\right] = b \int \nabla_\theta p_\theta(\tau) \, d\tau = b \, \nabla_\theta \underbrace{\int p_\theta(\tau) d\tau}_{= 1} = 0\]The baseline contributes zero in expectation, so the estimator remains unbiased while reducing variance.
Implementing Policy Gradients in Practice
The Surrogate Objective
Computing $N \times T$ individual backward passes is expensive. Instead, we define a surrogate objective whose gradient matches our policy gradient:
\[\tilde{J}(\theta) = \frac{1}{N} \sum_{i=1}^{N} \left( \sum_{t=1}^{T} \log \pi_\theta(\mathbf{a}_{i,t} | \mathbf{s}_{i,t}) \right) \left( \left( \sum_{t'=t}^{T} r(\mathbf{s}_{i,t'}, \mathbf{a}_{i,t'}) \right) - b \right)\]This is just a weighted maximum likelihood objective. For discrete actions, $\log \pi_\theta$ is a cross-entropy loss; for continuous actions with a Gaussian policy, it’s a squared error. The “weight” $\hat{A}_{i,t} = \text{reward-to-go} - b$ is treated as a constant (detached from the computation graph).
One call to .backward() in PyTorch gives you the policy gradient. No custom gradient code needed.
Off-Policy Policy Gradients
The Data Efficiency Problem
Vanilla policy gradient is on-policy: after every gradient step, we change $\theta$, which invalidates all our collected data. We have to throw it away and collect fresh trajectories. This is extremely wasteful.
Importance Sampling
What if we could reuse data from a previous policy $\pi_{\theta_\text{old}}$? Importance sampling lets us correct for the mismatch:
\[\mathbb{E}_{x \sim p(x)}[f(x)] = \mathbb{E}_{x \sim q(x)}\left[\frac{p(x)}{q(x)} f(x)\right]\]Applied to trajectories, the ratio $\frac{p_\theta(\tau)}{\bar{p}(\tau)}$ simplifies nicely because the dynamics cancel:
\[\frac{p_\theta(\tau)}{\bar{p}(\tau)} = \frac{\prod_{t=1}^{T} \pi_\theta(\mathbf{a}_t | \mathbf{s}_t)}{\prod_{t=1}^{T} \bar{\pi}(\mathbf{a}_t | \mathbf{s}_t)}\]The Practical “Hitch”
The problem: this product of ratios over $T$ timesteps can explode or vanish for long horizons. One ratio slightly above 1.0 compounded over hundreds of steps becomes enormous.
The practical solution: approximate using per-timestep ratios instead:
\[\nabla_{\theta'} J(\theta') \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \frac{\pi_{\theta'}(\mathbf{a}_{i,t} | \mathbf{s}_{i,t})}{\pi_\theta(\mathbf{a}_{i,t} | \mathbf{s}_{i,t})} \nabla_{\theta'} \log \pi_{\theta'}(\mathbf{a}_{i,t} | \mathbf{s}_{i,t}) \left( \left( \sum_{t'=t}^{T} r(\mathbf{s}_{i,t'}, \mathbf{a}_{i,t'}) \right) - b \right)\]This is an approximation (we’re ignoring changes in the state distribution), but it’s much more stable. It lets us take multiple gradient steps on the same batch of data.
Constraining the Policy Update
If $\pi_{\theta’}$ drifts too far from $\pi_\theta$, the importance sampling correction becomes unreliable. A common safeguard: constrain the KL divergence between the new and old policies:
\[\mathbb{E}_{\mathbf{s} \sim \pi_\theta} \left[ D_{KL}(\pi_{\theta'}(\cdot | \mathbf{s}) \| \pi_\theta(\cdot | \mathbf{s})) \right] \leq \delta\]This idea leads directly to algorithms like TRPO and PPO — but that’s a story for another post.
Summary
| Concept | What It Does |
|---|---|
| Log-derivative trick | Converts gradient-of-expectation to expectation-of-gradient |
| REINFORCE | Estimates policy gradient using sampled trajectories |
| Reward-to-go | Only weights actions by future rewards (causality) |
| Baseline | Centers rewards around average, reducing variance without bias |
| Surrogate objective | Enables efficient implementation via autograd |
| Importance sampling | Reuses old data for multiple gradient steps (off-policy) |
| KL constraint | Prevents policy from drifting too far between updates |
The core intuition: do more of the good stuff, less of the bad stuff. Policy gradients are noisy, high-variance, and sensitive to reward scale — but with causality tricks, baselines, and large batches, they form the backbone of modern RL. They’re how robots learn to walk and how LLMs get aligned via RLHF.
Based on notes from Stanford CS224R (Spring 2025) — Deep Reinforcement Learning.