Different from those action-value methods from the preceding chapters, policy-based methods use a parameterized function to approximate the agent’s policy. Recall the function to denote preference in the bandit setting, where action is selected according to the soft-max of all the preferences of actions represented by the function. Hence we extend it to other settings and use SGD as well. Firstly, we denote

$$ J(\pi)=v_\pi(s_0) $$

as the optimatization goal of policy-based methods. So,

$$ \bold{w}_{t+1}\leftarrow\bold{w}_t+\alpha\nabla J(\pi) $$

Now to we need to explicitly calculate the value of that.

$$ \nabla J(\pi)\propto \sum_s\mu(s)\sum_a\pi(a|s)q_\pi(s,a) $$

where

$$ \mu(s)=\frac{\eta(s)}{\sum_{s'}\eta(s')} $$

The complete proof can be found in the book written by Sutton(both in the episodic and continuing cases)

Now we introduce REINFORCE, a semi-Monte-Carlo way to leverage the policy-based methods:

$$ \nabla J(\pi)=\mathbb{E}_\pi[(G_t-b(s_t))\nabla\ln\pi] $$

where b denotes the baseline added to reduce variance.

From that, we can introduce REINFORCE+, which use the advantage to optimize:

$$ A(s_t,a_t)=G_t-v(s_t) $$

$$ \bold{w}_{t+1}\leftarrow\bold{w}_t+\alpha\nabla\ln\pi A(s_t,a_t) $$

This method leverages a function to approximate the state value. Inspired by this, we can also extend TD methods to policy optimization:

$$ L(ϕ)=(r_t+γV(s_{t+1};ϕ)−V(s_t;ϕ))^2 $$

This is called actor-critic methods.

Using advantage to update the function, we can get A2C; using multiple workers to experience with their network and backpropagate the gradient, we can get A3C.