## Notation

variable dimension name
$\tau$ $s_0, u_0, \dots, s_{H-1}, u_{H-1}$ trajectory
$p(s_{t+1}|s_t, u_t)$ $\mathbb{R}$ state transition dynamical model
$\pi_{\theta}(u_t|s_t)$ $\mathbb{R}$ policy
$P_{\theta}(\tau)$ $p(s_0)\prod_{t=0}^{H-1} p(s_{t+1}|s_t, u_t)\, \pi_{\theta}(u_t|s_t)$ probability of a trajectory
$R(\tau)$ $\sum_{t=0}^{H-1} r(s_t, u_t)$ Utility of a trajectory

## Policy gradient likelihood ratio

We seek to maximise the expected reward:

We can derive an expresion of the gradient of the above objective function which is independent of the state transition model $p(s_{t+1}|s_t, u_t)$ and only depends on the policy.

As this is an expaction over trajectories we can approximate the gradient by sampling trajectories from a fixed policy and state transition function:

Each gradient is of a trajectory is weighted by the reward of the full trajectory. An observation is that the current actions do not depend on past rewards.

This as an effect of reducing the variance of the gradients’ estimate.