Advanced Reinforcement Learning

1.1 Actor-Critic Reinforcement Learning

Reinforcement Learning learns a policy from environmental experiences, it gains reward signals from environment and accordingly adjust the model to maximize the expected rewards and thus formulate a policy. Value-based Reinforcement Learning define value functions that evaluate the states and actions of a markov decision process, the optimal policy is typically selected from greedy policy. Whereas the Policy-based Reinforcement Learning algorithms model the policy explicitly and optimize the policy by methods like gradient ascent.

Another group of algorithm integrates the advantages of both Value-based methods and Policy-based methods, namely Actor-Critic architectures, in Actor-Critic, we define a critic:

\[Q_w(s,a) \approx Q^{\pi_\theta}(s,a)\]

Where the critic approximates state-action value function \(Q(s,a)\), the actor approximates the policy \(\pi\), there are parameterized by \(w\) and \(\theta\) respectively.

Actor-critic algorithm follow an approximate policy gradient, the actor network can be updated by:

\[\nabla_\theta J(\theta) \approx \mathbb{E}_{\pi_\theta}[\nabla_\theta log \pi_\theta(s,a) Q_w(s,a)] \\ \nabla \theta = \alpha \nabla_\theta log \pi_\theta(s,a) Q_w(s,a) \\ \theta_{t+1} \leftarrow \theta_t + \lambda \nabla\theta_t\]

The critic network is based on critic functions, here use Q-function as an example, the critic network can be updated as:

\[\nabla_w J(w) \approx \mathbb{E}_{\pi_\theta}[MSE(Q_{t},R_{t+1}+max_a Q_{t+1})] \\ \nabla w = MSE(Q_{t},R_{t+1}+max_a Q_{t+1}) \\ w_{t+1} \leftarrow w_t + \lambda \nabla w_t\]

Instead of leting critic to estimate state-value function, we can allow it to alternatively estimates Advantage function \(A(s,a) = Q_w(s,a) - V_v(s)\) to reduce the variance. There are many alternative critic function choices.

1.2 Proximal Policy Optimization with AC-based Advantage Function

Baseline algorithm of OpenAI. PPO allows off-policy learning to policy gradient algorithm. In policy gradient, we update our policy network by compute gradient of the expected reward function with respect to policy parameters \(\theta\), and perform gradient acsent to maximize it:

\[\nabla J(\theta) = \frac{1}{N} \sum_{n=1}^N \sum_{t=1}^{T_n} R(\tau^n) \ \nabla_\theta \ log \ \pi_\theta(a_t^n | s_t^n) \\ = \mathbb{E}_{\tau \backsim \pi_\theta}[R(\tau^n) \ \nabla_\theta \ log \ \pi_\theta(a_t^n | s_t^n)] \\ \theta \leftarrow \theta + \alpha \nabla \bar{R}_\theta\]

In PPO algorithm, instead of sampling trajectories from policy \(\pi_\theta\), in order to increase sample efficiency(reuse expirence), we sample trajectories from another policy \(\pi_{\theta'}\) and apply a importance sampling method to correct the difference.

\[J_{\theta'}(\theta) = \mathbb{E}_{\tau \backsim \pi_{\theta'}}[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta'}(a_t|s_t)} R(\tau^n) ] \\ \nabla J_{\theta'}(\theta) = \mathbb{E}_{\tau \backsim \pi_{\theta'}}[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta'}(a_t|s_t)} R(\tau^n) \ \nabla_\theta \ log \ \pi_\theta(a_t^n | s_t^n)]\]

The reward \(R(\tau^n)\) can be replaced by advantage function \(A^{\theta'} (s_t,a_t)\), where we levarage the power of AC architecture the gain more suitable representation of the loss to optimize.

\[J_{\theta'}(\theta) = \mathbb{E}_{\tau \backsim \pi_{\theta'}}[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta'}(a_t|s_t)} A^{\theta'} (s_t,a_t)] \\ \nabla J_{\theta'}(\theta) = \mathbb{E}_{\tau \backsim \pi_{\theta'}}[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta'}(a_t|s_t)} A^{\theta'} (s_t,a_t) \ \nabla_\theta \ log \ \pi_\theta(a_t^n | s_t^n)]\]

1.2.1 PPO KL Regularzation

In addition, we need to add a regularzation term (or in TRPO, add a constrain) to the objective function to constrain the difference between two distributions, therefore the objective function becomes: \(J(\theta) = J_{\theta'}(\theta) - \beta KL(\theta,\theta')\) Where adaptively set the value of \(\beta\), specificlly when \(KL(\theta,\theta') > KL_{max}\), increase \(\beta\); when \(KL(\theta,\theta') < KL_{min}\), decrease \(\beta\).

1.2.1 PPO Clip Method

\[J_{clip}(\theta) \approx \sum_{s_t,a_t} min(\frac{p_\theta(a_t | s_t)}{p_{\theta'}(a_t|s_t)} A^{\theta'}(s_t,a_t), \ clip(\frac{p_\theta(a_t | s_t)}{p_{\theta'}(a_t|s_t)}, 1 - \epsilon , 1 + \epsilon) \ A^{\theta'}(s_t,a_t))\]

Where \(clip\) = \(1-\epsilon\) if \(\frac{p_\theta(a_t \ given \ s_t)}{p_{\theta'}(a_t \ given \ s_t)} < 1 - \epsilon\) ; \(clip\) = \(1+\epsilon\) if \(\frac{p_\theta(a_t \ given \ s_t)}{p_{\theta'}(a_t \ given \ s_t)} > 1 + \epsilon\) ; values that fall anywhere in between \(clip = \frac{p_\theta(a_t \ given \ s_t)}{p_{\theta'}(a_t \ given \ s_t)}\). This methods hence set an constrain that making sure the differences between two policy distributions changes within certain range of \(\epsilon\).