Resources

Policy gradient methods for robotics arg min blog: a blog of minimum value

Notation

variable dimension name
trajectory
state transition dynamical model
policy
probability of a trajectory
Utility of a trajectory

Policy gradient likelihood ratio

We seek to maximise the expected reward:

We can derive an expresion of the gradient of the above objective function which is independent of the state transition model and only depends on the policy.

As this is an expaction over trajectories we can approximate the gradient by sampling trajectories from a fixed policy and state transition function:

Each gradient is of a trajectory is weighted by the reward of the full trajectory. An observation is that the current actions do not depend on past rewards.

This as an effect of reducing the variance of the gradients’ estimate.