## Resources

Policy gradient methods for robotics arg min blog: a blog of minimum value

## Notation

variable | dimension | name |
---|---|---|

trajectory | ||

state transition dynamical model | ||

policy | ||

probability of a trajectory | ||

Utility of a trajectory |

## Policy gradient likelihood ratio

We seek to maximise the expected reward:

We can derive an expresion of the gradient of the above objective function which
is **independent** of the state transition model and only depends
on the policy.

As this is an expaction over trajectories we can approximate the gradient by sampling trajectories from a fixed policy and state transition function:

Each gradient is of a trajectory is weighted by the reward of the full trajectory. An observation is that the current actions do not depend on past rewards.

This as an effect of reducing the variance of the gradients’ estimate.