From Imitation to Reinforcement
- The Formal Objective
Deriving the Policy Gradient
- The Log-Derivative Trick
- Eliminating the Dynamics
The REINFORCE Algorithm
- What Does This Actually Mean?
The Variance Problem
- Example: Learning to Walk
Variance Reduction I: Reward-to-Go (Causality)
Variance Reduction II: Baselines
- The All-Positive Reward Problem
- Subtracting a Baseline
Implementing Policy Gradients in Practice
- The Surrogate Objective
Off-Policy Policy Gradients
Summary

From Imitation to Reinforcement

In the previous post, we trained a policy by imitating expert demonstrations. The training loop was straightforward: collect expert data, compute a gradient that pushes the policy toward every expert action, update the parameters, repeat.

Policy gradients follow the exact same loop — with two changes:

The agent collects its own trajectories instead of copying an expert
The gradient is weighted by reward — good actions get stronger updates, bad actions get weaker ones (or get pushed away)

IL vs Policy Gradients

This is a small change to the math, but a huge change in capability. Imitation learning has a ceiling — you can never outperform the demonstrator. With policy gradients, the agent explores on its own and can discover strategies the expert never showed. We don’t need an expert anymore. We just need a reward signal.

The Formal Objective

We want to find a policy $\pi_\theta$ that maximizes expected cumulative reward:

\[J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \sum_{t=1}^{T} r(\mathbf{s}_t, \mathbf{a}_t) \right]\]

where a trajectory $\tau = (\mathbf{s}_1, \mathbf{a}_1, \mathbf{s}_2, \mathbf{a}_2, \ldots, \mathbf{s}_T, \mathbf{a}_T)$ is generated by the policy interacting with the environment.

The training loop:

Run the current policy to collect a batch of trajectories
Estimate the gradient $\nabla_\theta J(\theta)$
Update $\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)$
Repeat

Training loop

The catch is step 2. How do we actually compute this gradient?

Deriving the Policy Gradient

We want $\nabla_\theta J(\theta)$, but $\theta$ influences $J$ through the distribution of trajectories $p_\theta(\tau)$. We can’t differentiate through the environment’s dynamics — we don’t even know what they are.

The Log-Derivative Trick

Start with the integral form:

\[\nabla_\theta J(\theta) = \nabla_\theta \int p_\theta(\tau) \, r(\tau) \, d\tau = \int \nabla_\theta p_\theta(\tau) \, r(\tau) \, d\tau\]

Now use the identity $\nabla_\theta p_\theta(\tau) = p_\theta(\tau) \nabla_\theta \log p_\theta(\tau)$:

\[\nabla_\theta J(\theta) = \int p_\theta(\tau) \, \nabla_\theta \log p_\theta(\tau) \, r(\tau) \, d\tau = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \nabla_\theta \log p_\theta(\tau) \, r(\tau) \right]\]

This is the key move — we’ve turned a gradient-of-an-expectation into an expectation-of-a-gradient, which means we can estimate it with samples.

IL gradient vs policy gradient

Eliminating the Dynamics

The trajectory probability decomposes as:

\[p_\theta(\tau) = p(\mathbf{s}_1) \prod_{t=1}^{T} \pi_\theta(\mathbf{a}_t | \mathbf{s}_t) \, p(\mathbf{s}_{t+1} | \mathbf{s}_t, \mathbf{a}_t)\]

Taking the log:

\[\log p_\theta(\tau) = \log p(\mathbf{s}_1) + \sum_{t=1}^{T} \log \pi_\theta(\mathbf{a}_t | \mathbf{s}_t) + \sum_{t=1}^{T} \log p(\mathbf{s}_{t+1} | \mathbf{s}_t, \mathbf{a}_t)\]

When we take $\nabla_\theta$, the initial state distribution and the dynamics terms vanish — they don’t depend on $\theta$. We’re left with:

\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \left( \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(\mathbf{a}_t | \mathbf{s}_t) \right) \left( \sum_{t=1}^{T} r(\mathbf{s}_t, \mathbf{a}_t) \right) \right]\]

We never need to know the dynamics. We only need to be able to sample trajectories and evaluate $\log \pi_\theta(\mathbf{a} \mid \mathbf{s})$ — which is just the neural network’s output.

The REINFORCE Algorithm

Approximating the expectation with $N$ sampled trajectories gives us the vanilla policy gradient (a.k.a. REINFORCE):

\[\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \left( \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(\mathbf{a}_{t}^i | \mathbf{s}_{t}^i) \right) \left( \sum_{t=1}^{T} r(\mathbf{s}_{t}^i, \mathbf{a}_{t}^i) \right)\]

The full algorithm:

Sample $\lbrace\tau^i\rbrace$ from $\pi_\theta(\mathbf{a}_t \mid \mathbf{s}_t)$

Compute $\nabla_\theta J(\theta)$ using the trajectories

$\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)$

What Does This Actually Mean?

REINFORCE vs IL gradient

Compare the REINFORCE gradient to the imitation learning gradient:

	Imitation Learning	REINFORCE
Data	Expert demonstrations	Self-generated rollouts
Gradient	$\sum \nabla_\theta \log \pi_\theta(\mathbf{a} \mid \mathbf{s})$	$\sum \nabla_\theta \log \pi_\theta(\mathbf{a} \mid \mathbf{s}) \cdot R(\tau)$
Intuition	Copy all actions equally	Copy actions, weighted by reward

REINFORCE is just imitation learning on your own data, where you “imitate yourself more” when things go well and “imitate yourself less” when things go badly. Do more of the good stuff, less of the bad stuff.

The Variance Problem

Example: Learning to Walk

Humanoid walking trajectories

Consider a simulated humanoid learning to walk, where $r(\mathbf{s}, \mathbf{a})$ = forward velocity.

Suppose we sample 5 trajectories:

Trajectory	What Happens	Reward
$\tau^1$	Falls backwards	Negative
$\tau^2$	Falls forwards	Positive
$\tau^3$	Stands still	Zero
$\tau^4$	One step forward, then falls back	Negative
$\tau^5$	Large step back, then small step forward	Negative

The gradient will increase the probability of falling forward (the highest-reward trajectory) and decrease the rest. The agent learns to faceplant instead of walk, because with only a few samples, “falling forward” looks like the best strategy.

This is the core issue: policy gradients are noisy and high-variance. With limited samples, the gradient is a rough approximation that can lead to strange local optima.

Variance Reduction I: Reward-to-Go (Causality)

The vanilla gradient weights action $\mathbf{a}_t$ by the total trajectory reward — including rewards from timesteps before $t$. But an action at $t=100$ can’t possibly affect the reward at $t=10$. This violates causality and adds pure noise.

The fix: only weight each action by rewards from that point onward (the reward-to-go):

\[\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(\mathbf{a}_{i,t} | \mathbf{s}_{i,t}) \left( \sum_{t'=t}^{T} r(\mathbf{s}_{i,t'}, \mathbf{a}_{i,t'}) \right)\]

Notice the inner sum now starts at $t’ = t$ instead of $t’ = 1$. This is still an unbiased estimator, but with significantly lower variance.

Going back to the walking example — trajectory $\tau^5$ (“large step backwards then small step forwards”) had a negative total reward. With vanilla REINFORCE, the gradient would push against the final small-step-forward action, even though that action was actually good. With reward-to-go, the forward step only sees the rewards it caused, and gets properly reinforced.

Variance Reduction II: Baselines

The All-Positive Reward Problem

Baseline example

Consider a robot learning to fold a jacket, with rewards:

\[r(\mathbf{s}, \mathbf{a}) = \begin{cases} 1.0 & \text{neatly folded} \\ 0.5 & \text{folded with wrinkles} \\ 0 & \text{not folded} \end{cases}\]

If all sampled trajectories get reward $\geq 0$, the gradient will increase the likelihood of everything — even the “do nothing” trajectories. It will just push up the good actions slightly more than the bad ones. This is slow and wasteful.

Subtracting a Baseline

We subtract a baseline $b$ from the reward:

\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \nabla_\theta \log p_\theta(\tau) \left( r(\tau) - b \right) \right]\]

A natural choice: $b = \frac{1}{N} \sum_{i=1}^{N} r(\tau^i)$ (the average reward across the batch).

Now actions are judged as “better than average” or “worse than average.” Trajectories that score below the mean get negative gradients, actively pushing the policy away from those actions.

Is this valid? Yes — subtracting any constant baseline leaves the gradient unbiased:

\[\mathbb{E}\left[\nabla_\theta \log p_\theta(\tau) \cdot b\right] = b \int \nabla_\theta p_\theta(\tau) \, d\tau = b \, \nabla_\theta \underbrace{\int p_\theta(\tau) d\tau}_{= 1} = 0\]

The baseline contributes zero in expectation, so the estimator remains unbiased while reducing variance.

Implementing Policy Gradients in Practice

The Surrogate Objective

Computing $N \times T$ individual backward passes is expensive. Instead, we define a surrogate objective whose gradient matches our policy gradient:

\[\tilde{J}(\theta) = \frac{1}{N} \sum_{i=1}^{N} \left( \sum_{t=1}^{T} \log \pi_\theta(\mathbf{a}_{i,t} | \mathbf{s}_{i,t}) \right) \left( \left( \sum_{t'=t}^{T} r(\mathbf{s}_{i,t'}, \mathbf{a}_{i,t'}) \right) - b \right)\]

This is just a weighted maximum likelihood objective. For discrete actions, $\log \pi_\theta$ is a cross-entropy loss; for continuous actions with a Gaussian policy, it’s a squared error. The “weight” $\hat{A}_{i,t} = \text{reward-to-go} - b$ is treated as a constant (detached from the computation graph).

One call to .backward() in PyTorch gives you the policy gradient. No custom gradient code needed.

Off-Policy Policy Gradients

The Data Efficiency Problem

Vanilla policy gradient is on-policy: after every gradient step, we change $\theta$, which invalidates all our collected data. We have to throw it away and collect fresh trajectories. This is extremely wasteful.

Importance Sampling

What if we could reuse data from a previous policy $\pi_{\theta_\text{old}}$? Importance sampling lets us correct for the mismatch:

\[\mathbb{E}_{x \sim p(x)}[f(x)] = \mathbb{E}_{x \sim q(x)}\left[\frac{p(x)}{q(x)} f(x)\right]\]

Applied to trajectories, the ratio $\frac{p_\theta(\tau)}{\bar{p}(\tau)}$ simplifies nicely because the dynamics cancel:

\[\frac{p_\theta(\tau)}{\bar{p}(\tau)} = \frac{\prod_{t=1}^{T} \pi_\theta(\mathbf{a}_t | \mathbf{s}_t)}{\prod_{t=1}^{T} \bar{\pi}(\mathbf{a}_t | \mathbf{s}_t)}\]

The Practical “Hitch”

The problem: this product of ratios over $T$ timesteps can explode or vanish for long horizons. One ratio slightly above 1.0 compounded over hundreds of steps becomes enormous.

The practical solution: approximate using per-timestep ratios instead:

\[\nabla_{\theta'} J(\theta') \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \frac{\pi_{\theta'}(\mathbf{a}_{i,t} | \mathbf{s}_{i,t})}{\pi_\theta(\mathbf{a}_{i,t} | \mathbf{s}_{i,t})} \nabla_{\theta'} \log \pi_{\theta'}(\mathbf{a}_{i,t} | \mathbf{s}_{i,t}) \left( \left( \sum_{t'=t}^{T} r(\mathbf{s}_{i,t'}, \mathbf{a}_{i,t'}) \right) - b \right)\]

This is an approximation (we’re ignoring changes in the state distribution), but it’s much more stable. It lets us take multiple gradient steps on the same batch of data.

Constraining the Policy Update

If $\pi_{\theta’}$ drifts too far from $\pi_\theta$, the importance sampling correction becomes unreliable. A common safeguard: constrain the KL divergence between the new and old policies:

\[\mathbb{E}_{\mathbf{s} \sim \pi_\theta} \left[ D_{KL}(\pi_{\theta'}(\cdot | \mathbf{s}) \| \pi_\theta(\cdot | \mathbf{s})) \right] \leq \delta\]

This idea leads directly to algorithms like TRPO and PPO — but that’s a story for another post.

Summary

Concept	What It Does
Log-derivative trick	Converts gradient-of-expectation to expectation-of-gradient
REINFORCE	Estimates policy gradient using sampled trajectories
Reward-to-go	Only weights actions by future rewards (causality)
Baseline	Centers rewards around average, reducing variance without bias
Surrogate objective	Enables efficient implementation via autograd
Importance sampling	Reuses old data for multiple gradient steps (off-policy)
KL constraint	Prevents policy from drifting too far between updates

The core intuition: do more of the good stuff, less of the bad stuff. Policy gradients are noisy, high-variance, and sensitive to reward scale — but with causality tricks, baselines, and large batches, they form the backbone of modern RL. They’re how robots learn to walk and how LLMs get aligned via RLHF.

Based on notes from Stanford CS224R (Spring 2025) — Deep Reinforcement Learning.