- Neural Nets are the ‘Policy’
- Discrete Actions: The Simple Case
- Continuous Actions: Where Things Get Interesting
- Attempt 1: Predict a Single Value - A Deterministic Policy
- Attempt 2: Predicting a Simple Distribution (Gaussian) - A Stochastic Policy
- Attempt 3: Gaussian Mixture Model - Slightly More Expressive Stochastic Policy
- Attempt 4: Autoregressive Models: Even More Expressive Stochastic Policies
- Training Stochastic Policies with Maximum Likelihood
- Extension 2: Teacher Forcing (used exactly similarly in autoregressive language models)
- Summary: The Expressiveness Ladder
- Appendix: The Reparameterization Trick
Neural Nets are the ‘Policy’
In the case of imitation learning or behaviour cloning this can be thought of as: given expert demonstrations, can we train a neural network (a policy) to mimic those actions? This turns out to be surprisingly nuanced. The way we design the model’s outputs and loss function has a huge impact on how “expressive” the learned policy can be.
This post walks through the progression from naive approaches to more expressive generative policies.
Discrete Actions: The Simple Case
Consider training a network to play Pacman.
This is just a classification problem. The model outputs 4 logits (one per action), and we train with cross-entropy loss. Simple, clean, and maximally expressive i.e. the softmax output can represent any distribution over those 4 buttons.
But what happens when actions are continuous…?
Continuous Actions: Where Things Get Interesting
Now suppose we want to train a policy to estimate steering angle for a self-driving car.
Attempt 1: Predict a Single Value - A Deterministic Policy
The simplest approach: make the model output a single angle and train with L2 (MSE) loss.
The problem: mean-averaging. Imagine we collect data from four expert drivers, all in the same state:
| Robot State | Expert Driver | Action (angle $\theta$) |
|---|---|---|
| s | Driver_1 | -10° |
| s | Driver_2 | +10° |
| s | Driver_3 | -20° |
| s | Driver_4 | +20° |
With L2 loss, the model will learn to predict 0° — the mean of all expert actions.
That’s neither left nor right, which could be catastrophic if the experts were, say, swerving to avoid an obstacle.
Beyond mean-averaging, deterministic policies have a compounding error problem
- The model is trained on expert data, so it learns to predict well on states that the expert visits.
- But at test time, if the model makes a small mistake and ends up in a state that the expert never visited, it has no idea what to do
Attempt 2: Predicting a Simple Distribution (Gaussian) - A Stochastic Policy
What if instead of predicting a single value, the model outputs the parameters of a Gaussian distribution — a mean $\mu$ and variance $\sigma^2$?
- The expert data distribution for our steering example was bimodal (some drivers go left, some go right)
- A single Gaussian can only capture one mode
- If we train with L2 loss (which is equivalent to fitting the mean of a Gaussian), we again predict the average — right between the two modes.
Even if we somehow get the Gaussian to match one of the modes, we’re constraining the model to represent only a unimodal distribution. Not expressive enough.
Attempt 3: Gaussian Mixture Model - Slightly More Expressive Stochastic Policy
What if we ask the model to predict a mixture of Gaussians?
This is more expressive, but still limited by the number of mixture components we choose. If we pick 2 components, we can capture the bimodal distribution. But what if the expert data has 3 modes? Or 10 modes? We would need to arbitrarily choose a large number of components, which is inefficient and still may not capture the true distribution well.
Attempt 4: Autoregressive Models: Even More Expressive Stochastic Policies
Here we rely on discretizing the action space into small enough ‘bins’ such that we are as ‘continous’ as possible. (our best approximation)
Each bin is again a probability (multi-class classification). More bins = more expressiveness, just like larger vocabulary size in LLMs.
Training Stochastic Policies with Maximum Likelihood
Extension 1: Predicting Multiple Actions per State
Extension 2: Teacher Forcing (used exactly similarly in autoregressive language models)
Teacher forcing means that during training, we feed the ground truth $a_1$ (from the expert data) as input when computing $a_2$’s loss, rather than the model’s own predicted $a_1$. This stabilizes training.
Summary: The Expressiveness Ladder
| Approach | Output | Loss | Expressiveness |
|---|---|---|---|
| Deterministic | Single value | L2 (MSE) | Lowest — predicts the mean |
| Single Gaussian | $\mu, \sigma$ | Gaussian NLL | Unimodal only |
| Mixture of Gaussians | $\{\mu_i, \sigma_i, w_i\}$ | Mixture NLL | Multi-modal, fixed components |
| Autoregressive (discretized) | Bin probabilities | Cross-entropy | Maximally expressive |
The formal objective for expressive imitation learning:
\[\min_\theta \; -\mathbb{E}_{(\mathbf{s},\mathbf{a}) \sim \mathcal{D}}\left[\log \pi_\theta(\mathbf{a} \mid \mathbf{s})\right]\]with an expressive distribution $\pi(\cdot \mid \mathbf{s})$. The more expressive the policy class, the better it can capture the full distribution of expert behavior.
Appendix: The Reparameterization Trick
When training a Gaussian policy, we need gradients to flow through the sampling step. But sampling is stochastic — you can’t backpropagate through randomness.
The reparameterization trick solves this by rewriting:
\[z \sim \mathcal{N}(\mu, \sigma^2)\]as a deterministic function of the parameters plus external noise:
\[z = \mu + \sigma \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, 1)\]Now $\mu$ and $\sigma$ are deterministic operations in the computation graph, and gradients flow through them cleanly:
\[\frac{\partial z}{\partial \mu} = 1, \qquad \frac{\partial z}{\partial \sigma} = \epsilon\]In PyTorch, this is the difference between dist.sample() (no gradients) and dist.rsample() (reparameterized, gradients flow).
Full animation: The Reparameterization Trick
Based on notes from Stanford CS224R (Spring 2025) — Deep Reinforcement Learning.