Different from those action-value methods from the preceding chapters, policy-based methods use a parameterized function to approximate the agent’s policy. Recall the function to denote preference in the bandit setting, where action is selected according to the soft-max of all the preferences of actions represented by the function. Hence we extend it to other settings and use SGD as well. Firstly, we denote
$$ J(\pi)=v_\pi(s_0) $$
as the optimatization goal of policy-based methods. So,
$$ \bold{w}_{t+1}\leftarrow\bold{w}_t+\alpha\nabla J(\pi) $$
Now to we need to explicitly calculate the value of that.
$$ \nabla J(\pi)\propto \sum_s\mu(s)\sum_a\pi(a|s)q_\pi(s,a) $$
where
$$ \mu(s)=\frac{\eta(s)}{\sum_{s'}\eta(s')} $$
The complete proof can be found in the book written by Sutton(both in the episodic and continuing cases)
Now we introduce REINFORCE, a semi-Monte-Carlo way to leverage the policy-based methods:
$$ \nabla J(\pi)=\mathbb{E}_\pi[(G_t-b(s_t))\nabla\ln\pi] $$
where b denotes the baseline added to reduce variance.
From that, we can introduce REINFORCE+, which use the advantage to optimize:
$$ A(s_t,a_t)=G_t-v(s_t) $$
$$ \bold{w}_{t+1}\leftarrow\bold{w}_t+\alpha\nabla\ln\pi A(s_t,a_t) $$
This method leverages a function to approximate the state value. Inspired by this, we can also extend TD methods to policy optimization:
$$ L(ϕ)=(r_t+γV(s_{t+1};ϕ)−V(s_t;ϕ))^2 $$
This is called actor-critic methods.
Using advantage to update the function, we can get A2C; using multiple workers to experience with their network and backpropagate the gradient, we can get A3C.