From Imitation to Reinforcement
- The Formal Objective
Deriving the Policy Gradient
- The Log-Derivative Trick
- Eliminating the Dynamics
The REINFORCE Algorithm
- What Does This Actually Mean?
The Variance Problem
Off-Policy Policy Gradients - The Data Efficiency Problem
- Importance Sampling
- Why Per-Timestep Ratios, Not Trajectory Ratios?
Summary

From Imitation to Reinforcement

In the previous post, we trained a policy by imitating expert demonstrations. The training loop was straightforward: collect expert data, compute a gradient that pushes the policy toward every expert action, update the parameters, repeat.

Policy gradients follow the exact same loop — with two changes:

The agent collects its own trajectories instead of copying an expert
The gradient is weighted by reward — good actions get stronger updates, bad actions get weaker ones (or get pushed away)

IL vs Policy Gradients

This is a small change to the math, but a huge change in capability. Imitation learning has a ceiling — you can never outperform the demonstrator. With policy gradients, the agent explores on its own and can discover strategies the expert never showed. We don’t need an expert anymore. We just need a reward signal.

The Formal Objective

We want to find a policy $\pi_\theta$ that maximizes expected cumulative reward:

\[J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \sum_{t=1}^{T} r(\mathbf{s}_t, \mathbf{a}_t) \right]\]

where a trajectory $\tau = (\mathbf{s}_1, \mathbf{a}_1, \mathbf{s}_2, \mathbf{a}_2, \ldots, \mathbf{s}_T, \mathbf{a}_T)$ is generated by the policy interacting with the environment.

The training loop:

Run the current policy to collect a batch of trajectories
Estimate the gradient $\nabla_\theta J(\theta)$
Update $\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)$
Repeat

Training loop

The catch is step 2. How do we actually compute this gradient?

Deriving the Policy Gradient

We want $\nabla_\theta J(\theta)$, but $\theta$ influences $J$ through the distribution of trajectories $p_\theta(\tau)$. We can’t differentiate through the environment’s dynamics — we don’t even know what they are.

The Log-Derivative Trick

Start with the integral form:

\[\nabla_\theta J(\theta) = \nabla_\theta \int p_\theta(\tau) \, r(\tau) \, d\tau = \int \nabla_\theta p_\theta(\tau) \, r(\tau) \, d\tau\]

Now use the identity $\nabla_\theta p_\theta(\tau) = p_\theta(\tau) \nabla_\theta \log p_\theta(\tau)$:

\[\nabla_\theta J(\theta) = \int p_\theta(\tau) \, \nabla_\theta \log p_\theta(\tau) \, r(\tau) \, d\tau = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \nabla_\theta \log p_\theta(\tau) \, r(\tau) \right]\]

This is the key move — we’ve turned a gradient-of-an-expectation into an expectation-of-a-gradient, which means we can estimate it with samples.

Notice how similar this is to the imitation learning gradient. The only difference is the reward weight:

	Gradient
Imitation Learning	$\nabla_\theta J \approx \sum \nabla_\theta \log \pi_\theta(\mathbf{a} \mid \mathbf{s}) \times \mathbf{1}$
Policy Gradient	$\nabla_\theta J \approx \sum \nabla_\theta \log \pi_\theta(\mathbf{a} \mid \mathbf{s}) \times \mathbf{r(\tau)}$

Same log-probability gradient underneath — just multiply by reward.

Eliminating the Dynamics

The trajectory probability decomposes as:

\[p_\theta(\tau) = p(\mathbf{s}_1) \prod_{t=1}^{T} \pi_\theta(\mathbf{a}_t | \mathbf{s}_t) \, p(\mathbf{s}_{t+1} | \mathbf{s}_t, \mathbf{a}_t)\]

Taking the log:

\[\log p_\theta(\tau) = \log p(\mathbf{s}_1) + \sum_{t=1}^{T} \log \pi_\theta(\mathbf{a}_t | \mathbf{s}_t) + \sum_{t=1}^{T} \log p(\mathbf{s}_{t+1} | \mathbf{s}_t, \mathbf{a}_t)\]

When we take $\nabla_\theta$, the initial state distribution and the dynamics terms vanish — they don’t depend on $\theta$. We’re left with:

\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \left( \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(\mathbf{a}_t | \mathbf{s}_t) \right) \left( \sum_{t=1}^{T} r(\mathbf{s}_t, \mathbf{a}_t) \right) \right]\]

We never need to know the dynamics. We only need to be able to sample trajectories and evaluate $\log \pi_\theta(\mathbf{a} \mid \mathbf{s})$ — which is just the neural network’s output.

The REINFORCE Algorithm

Approximating the expectation with $N$ sampled trajectories gives us the vanilla policy gradient (a.k.a. REINFORCE):

\[\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \left( \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(\mathbf{a}_{t}^i | \mathbf{s}_{t}^i) \right) \left( \sum_{t=1}^{T} r(\mathbf{s}_{t}^i, \mathbf{a}_{t}^i) \right)\]

The full algorithm:

Sample $\lbrace\tau^i\rbrace$ from $\pi_\theta(\mathbf{a}_t \mid \mathbf{s}_t)$

Compute $\nabla_\theta J(\theta)$ using the trajectories

$\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)$

But wait — in imitation learning, we checked whether our distribution put high probability where the expert acted. Without an expert, what do we compare against?

Nothing. The agent samples from its own distribution, takes the action, and observes a reward. The log-probability computation is identical — the only difference is whose action we evaluate, and what we do with the gradient:

The agent asks “what probability did I assign to the action I just took?” — same cross-entropy as IL. But instead of always pushing that probability up (as IL does for expert actions), the reward decides: positive reward pushes it up, negative reward pushes it down.

What Does This Actually Mean?

REINFORCE vs IL gradient

Compare the REINFORCE gradient to the imitation learning gradient:

	Imitation Learning	REINFORCE
Data	Expert demonstrations	Self-generated rollouts
Gradient	$\sum \nabla_\theta \log \pi_\theta(\mathbf{a} \mid \mathbf{s})$	$\sum \nabla_\theta \log \pi_\theta(\mathbf{a} \mid \mathbf{s}) \cdot R(\tau)$
Intuition	Copy all actions equally	Copy actions, weighted by reward

REINFORCE is just imitation learning on your own data, where you “imitate yourself more” when things go well and “imitate yourself less” when things go badly. Do more of the good stuff, less of the bad stuff.

Here’s the full training loop in PyTorch. We’ll keep building on this code throughout the post:

# ---- Vanilla REINFORCE ----
for epoch in range(num_epochs):
    # 1. collect N trajectories, each T timesteps long
    states, actions, rewards = collect_trajectories(env, policy, N, T)
    # states: (N, T, state_dim), actions: (N, T), rewards: (N, T)

    # 2. total reward per trajectory
    total_rewards = rewards.sum(dim=1)  # (N,)

    # 3. log probs for every (s, a) pair
    dist = policy(states)                                     # batched forward pass
    log_probs = dist.log_prob(actions)                        # (N, T)

    # 4. weight each timestep by TOTAL trajectory reward
    weights = total_rewards[:, None].expand_as(log_probs)     # (N, T)

    # 5. policy gradient loss
    loss = -(log_probs * weights).mean()
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Every action in a trajectory gets the same weight — the total reward of that trajectory. Good trajectory? Push all its actions up. Bad trajectory? Push them all down.

The Variance Problem

Example: Learning to Walk

Humanoid walking trajectories

Consider a simulated humanoid learning to walk, where $r(\mathbf{s}, \mathbf{a})$ = forward velocity.

Suppose we sample 5 trajectories:

Trajectory	What Happens	Reward
$\tau^1$	Falls backwards	Negative
$\tau^2$	Falls forwards	Positive
$\tau^3$	Stands still	Zero
$\tau^4$	One step forward, then falls back	Negative
$\tau^5$	Large step back, then small step forward	Negative

The gradient will increase the probability of falling forward (the highest-reward trajectory) and decrease the rest. The agent learns to faceplant instead of walk, because with only a few samples, “falling forward” looks like the best strategy.

This is the core issue: policy gradients are noisy and high-variance. With limited samples, the gradient is a rough approximation that can lead to strange local optima.

Variance Reduction I: Reward-to-Go (Causality)

The vanilla gradient weights action $\mathbf{a}_t$ by the total trajectory reward — including rewards from timesteps before $t$. But an action at $t=100$ can’t possibly affect the reward at $t=10$. This violates causality and adds pure noise.

The fix: only weight each action by rewards from that point onward (the reward-to-go):

\[\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(\mathbf{a}_{i,t} | \mathbf{s}_{i,t}) \left( \sum_{t'=t}^{T} r(\mathbf{s}_{i,t'}, \mathbf{a}_{i,t'}) \right)\]

Notice the inner sum now starts at $t’ = t$ instead of $t’ = 1$. This is still an unbiased estimator, but with significantly lower variance.

In code, the reward-to-go is a reverse cumulative sum. Let’s build it step by step:

def compute_rewards_to_go(rewards):
    """
    rewards come from the environment (simulator) after each action.
    Shape: (N, T) — N trajectories, each T timesteps long.

    We want rtg[i, t] = sum of rewards[i, t:] (future rewards from t onward).
    """
    # example rewards single batch (single trajectory): [1, 0, 3, 2]

    # step 1: reverse the time axis
    reversed_rewards = torch.flip(rewards, dims=[1])
    # [2, 3, 0, 1]

    # step 2: cumulative sum along time
    reversed_cumsum = torch.cumsum(reversed_rewards, dim=1)
    # [2, 5, 5, 6]

    # step 3: flip back to original order
    rtg = torch.flip(reversed_cumsum, dims=[1])
    # [6, 5, 5, 2]  ← reward-to-go!
    #  t=0: 1+0+3+2=6  (sees everything)
    #  t=3:       2=2   (only the last reward)

    return rtg

The training loop changes by one line:

    # 4. weight each timestep by FUTURE rewards only (not total)
    weights = compute_rewards_to_go(rewards)                  # (N, T)

Variance Reduction II: Baselines

The All-Positive Reward Problem

Baseline example

Consider a robot learning to fold a jacket, with rewards:

\[r(\mathbf{s}, \mathbf{a}) = \begin{cases} 1.0 & \text{neatly folded} \\ 0.5 & \text{folded with wrinkles} \\ 0 & \text{not folded} \end{cases}\]

If all sampled trajectories get reward $\geq 0$, the gradient will increase the likelihood of everything — even the “do nothing” trajectories. It will just push up the good actions slightly more than the bad ones. This is slow and wasteful.

Subtracting a Baseline

We subtract a baseline $b$ from the reward:

\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \nabla_\theta \log p_\theta(\tau) \left( r(\tau) - b \right) \right]\]

A natural choice: $b = \frac{1}{N} \sum_{i=1}^{N} r(\tau^i)$ (the average reward across the batch).

Now actions are judged as “better than average” or “worse than average.” Trajectories that score below the mean get negative gradients, actively pushing the policy away from those actions.

Is this valid? Yes — subtracting any constant baseline leaves the gradient unbiased:

\[\mathbb{E}\left[\nabla_\theta \log p_\theta(\tau) \cdot b\right] = b \int \nabla_\theta p_\theta(\tau) \, d\tau = b \, \nabla_\theta \underbrace{\int p_\theta(\tau) d\tau}_{= 1} = 0\]

The baseline contributes zero in expectation, so the estimator remains unbiased while reducing variance.

Combining reward-to-go with a baseline gives us the advantage — “how much better was this action than the average action at this timestep?”:

\[\hat{A}_{i,t} = \underbrace{\sum_{t'=t}^{T} r(\mathbf{s}_{i,t'}, \mathbf{a}_{i,t'})}_{\text{reward-to-go}} - \underbrace{b_t}_{\text{baseline}}\]

In our training loop, step 4 becomes:

    # 4. advantages = reward-to-go minus baseline
    rtg = compute_rewards_to_go(rewards)                      # (N, T)
    baseline = rtg.mean(dim=0)                                # (T,) — average across batch
    advantages = (rtg - baseline).detach()                    # (N, T)
    weights = advantages

The ‘Surrogate Objective’

These are just fancy words. The concept is straightforward!

Look back at our training loop. We’ve been writing:

    # 5. policy gradient loss
    loss = -(log_probs * weights).mean()
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Why can’t we just call .backward() on -(log_probs * advantages) directly? That tensor has shape (N, T) — it’s not a scalar. PyTorch’s autograd can only backpropagate from a scalar. You need to reduce the tensor to a single number first, and .mean() or .sum() both work. The difference is just a constant factor of $\frac{1}{N \cdot T}$, which the learning rate absorbs anyway. .mean() is conventional because it makes the loss scale independent of batch size.

The deeper question: the policy gradient formula says “for each (state, action) pair, compute $\nabla_\theta \log \pi_\theta(a \mid s)$ and scale by the advantage.” Read literally, that’s a separate backward pass per sample:

# literal reading: one backward pass per (s,a) pair — N*T backward passes!
for i in range(N):
    for t in range(T):
        log_prob = policy(states[i,t]).log_prob(actions[i,t])
        log_prob.backward()
        for p in policy.parameters():
            p.grad *= advantages[i,t]

But autograd can differentiate through .mean() (or .sum()) in one pass. Since the advantages are constants (.detach()), the gradient distributes over the sum:

\[\nabla_\theta \left[\sum_{i,t} \log \pi_\theta(\mathbf{a}_{i,t} | \mathbf{s}_{i,t}) \cdot \hat{A}_{i,t}\right] = \sum_{i,t} \hat{A}_{i,t} \cdot \nabla_\theta \log \pi_\theta(\mathbf{a}_{i,t} | \mathbf{s}_{i,t})\]

That’s exactly the policy gradient. One .backward(), done. The loss value itself is meaningless — it doesn’t tell you how good the policy is. Only its gradient matters.

Off-Policy Policy Gradients - The Data Efficiency Problem

Vanilla policy gradient is on-policy: after every gradient step, we change $\theta$, which invalidates all our collected data. We have to throw it away and collect fresh trajectories. This is extremely wasteful.

On-policy data problem

Importance Sampling

What if we could reuse data from a previous policy $\pi_{\theta_\text{old}}$? The data was sampled from the old policy, but we want to estimate what the gradient would look like under the new policy. Importance sampling says: you can do this by reweighting each sample.

The intuition: suppose the old policy took action $a$ with probability 0.3, and the new policy would take that same action with probability 0.6. Then under the new policy, this action is twice as likely — so we should count it twice as much. The weight is just the ratio:

\[\text{weight} = \frac{\pi_{\theta_\text{new}}(a \mid s)}{\pi_{\theta_\text{old}}(a \mid s)} = \frac{0.6}{0.3} = 2.0\]

If the new policy assigns less probability to an action than the old one did, the ratio is less than 1 and that sample counts less. This rebalances old samples to look as if they came from the new policy.

Importance weights rebalance

In code, the key idea: collect one batch, then take multiple gradient steps on it. The ratio corrects for the fact that the policy changes between steps. Here’s our training loop updated:

# ---- Off-policy: reuse old data with importance sampling ----
for epoch in range(num_epochs):
    states, actions, rewards = collect_trajectories(env, policy, N, T)

    rtg = compute_rewards_to_go(rewards)
    advantages = (rtg - rtg.mean(dim=0)).detach()

    # snapshot the log probs BEFORE we start updating
    old_log_probs = policy(states).log_prob(actions).detach()  # (N, T)

    for k in range(K):  # K gradient steps on the SAME data
        new_log_probs = policy(states).log_prob(actions)       # (N, T)
        ratios = (new_log_probs - old_log_probs).exp()         # π_new / π_old

        loss = -(ratios * advantages).mean()
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

The outer loop (for epoch) collects fresh data. The inner loop (for k in range(K)) squeezes multiple gradient steps out of that same data. Without importance sampling we could only do K=1 — one step, then throw the data away. Now we get $K$ steps per collection, which is $K\times$ more sample-efficient.

The ratio π_new(a|s) / π_old(a|s) stays close to 1.0 in the first few inner steps, then drifts as the policy changes. This is what makes the reuse valid — and also what limits it.

Why Per-Timestep Ratios, Not Trajectory Ratios?

You might wonder: shouldn’t the importance weight correct the entire trajectory probability? The trajectory ratio is a product of per-step ratios:

\[\frac{p_{\theta'}(\tau)}{p_\theta(\tau)} = \prod_{t=1}^{T} \frac{\pi_{\theta'}(\mathbf{a}_t | \mathbf{s}_t)}{\pi_\theta(\mathbf{a}_t | \mathbf{s}_t)}\]

Each individual ratio is close to 1.0 — but the product compounds:

# each per-step ratio is close to 1.0...
ratios_per_step = [1.08, 1.12, 0.95, 1.10, 1.06, 1.15, 1.03, ...]

# ...but the product explodes
import math
math.prod([1.08] * 50)   # 1.08^50 = 46.9 — one sample dominates!
math.prod([0.95] * 50)   # 0.95^50 = 0.077 — this sample vanishes

Our code above already avoids this — it uses per-timestep ratios, not the trajectory product. Each ratio stays bounded near 1.0. This is an approximation (we’re ignoring changes in the state distribution), but it’s far more stable in practice.

Summary

The core intuition: do more of the good stuff, less of the bad stuff. Policy gradients are noisy, high-variance, and sensitive to reward scale — but with causality tricks, baselines, and large batches, they work!

Based on notes from Stanford CS224R (Spring 2025) — Deep Reinforcement Learning.