## Notation

variable dimension name
$\tau$ $s_0, u_0, \dots, s_{H-1}, u_{H-1}$ trajectory
$p(s_{t+1}|s_t, u_t)$ $\mathbb{R}$ state transition dynamical model
$\pi_{\theta}(u_t|s_t)$ $\mathbb{R}$ policy
$P_{\theta}(\tau)$ $p(s_0)\prod_{t=0}^{H-1} p(s_{t+1}|s_t, u_t)\, \pi_{\theta}(u_t|s_t)$ probability of a trajectory
$R(\tau)$ $\sum_{t=0}^{H-1} r(s_t, u_t)$ Utility of a trajectory

We can derive an expresion of the gradient of the above objective function which is independent of the state transition model $p(s_{t+1}|s_t, u_t)$ and only depends on the policy.