Talks and tutorials on RL

Value and Q-value recursion

There are two forms the expected reward for a given state is encoded:

  • v-function:
  • q-function:

The v-function is the expected reward given a state whilst the q-function is for a state and action. The recursive aspect of both these two functions can be derived from first principal and it can be shown that the v-function is a function of the q-function.

See RVQ.pdf for the derivation of the recursion and the link between both functional forms.

See RL_Solutions_Chap3.pdf for the effect of sign and constants in the reward function.

Policy Gradient Theorem

We want to find an expression for which uses an estimator of the expected reward such as the action-value or advantage function.

Policy Gradient Methods for Reinforcement Learning with Function Approximation Proves that the gradient of a policy be derived when using a function approximator for either an action-value or advantage function.

The key is to able to find an unbiased estimage of the gradient