Advanced Reinforcement Learning
Advanced Reinforcement Learning
1.1 Actor-Critic Reinforcement Learning
Reinforcement Learning learns a policy from environmental experiences, it gains reward signals from environment and accordingly adjust the model to maximize the expected rewards and thus formulate a policy. Value-based Reinforcement Learning define value functions that evaluate the states and actions of a markov decision process, the optimal policy is typically selected from greedy policy. Whereas the Policy-based Reinforcement Learning algorithms model the policy explicitly and optimize the policy by methods like gradient ascent.
Another group of algorithm integrates the advantages of both Value-based methods and Policy-based methods, namely Actor-Critic architectures, in Actor-Critic, we define a critic:
Where the critic approximates state-action value function
Actor-critic algorithm follow an approximate policy gradient, the actor network can be updated by:
The critic network is based on critic functions, here use Q-function as an example, the critic network can be updated as:
Instead of leting critic to estimate state-value function, we can allow it to alternatively estimates Advantage function
1.2 Proximal Policy Optimization with AC-based Advantage Function
Baseline algorithm of OpenAI. PPO allows off-policy learning to policy gradient algorithm. In policy gradient, we update our policy network by compute gradient of the expected reward function with respect to policy parameters
In PPO algorithm, instead of sampling trajectories from policy
The reward
1.2.1 PPO KL Regularzation
In addition, we need to add a regularzation term (or in TRPO, add a constrain) to the objective function to constrain the difference between two distributions, therefore the objective function becomes:
1.2.1 PPO Clip Method
Where