Download full version of this notes with more details and images here.

Reinforcement Learning

Model-based RL: Dynamic Programming

In MDP, we want to model the state value function and state-action value functions, besed on which, we can form a strategy that greedly select actions with max state-action value in for each individual state.

We define state value function \(v_\pi\) and stage-action value function \(q_\pi\) for a certain policy \(\pi\) as following:

\[q_\pi(s,a) = \mathbb{E}_{\tau \backsim \pi}[R(\tau) | S_0=s, A_0=a] \\ v_\pi(s) = \mathbb{E}_{a \backsim \tau}[q_\pi(s,a)]\]

The Bellman Expectation Functions provids us a way to iteratively calculate value functions by decompose them into immediate reward plus discounted value of successor state.

\[v_\pi(s) = \mathbb{E}_\pi[R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t = s] \\ q_\pi(s,a) = \mathbb{E}_\pi[R_{t+1}+\gamma q_\pi(S_{t+1},A_{t+1} | S_t=s,A_t=a)]\]

The Bellman Equations can be solved directly if we have full information of the environment(i.e. we know the state transformation function), in discrete finite state environment: \(v = r + \gamma Pv\)

From which \(v\) and \(r\) are scalars, \(P\) is state transform probability matrix. Solve it directly we get:

\[v = (I-\gamma P)^{-1}r\]

The Bellman Optimally Equation can be then written as:

\[v_*(s) = max_{a}\mathbb{E}[R_s^a + \gamma v_*(s')] = max_{a} R_s^a + \gamma \sum_{s' \in S} p_{ss'}^a v_*(s') \\ q_*(s,a) = R_s^a + \gamma \mathbb{E}[max_{a'} q_*(s',a')] = R_s^a + \gamma \sum_{s' \in S} p_{ss'}^a max_{a'} q_*(s',a')\]

The complexity of solving Bellman Expectation equation is \(O(n^3)\), where \(n\) is the number of states, that means it is hard to solve when having large state space. In such case, we need to use methods like Dynamic Programming, Monte-Carlo Estimation, or Temporal Difference. In other hand, Bellman Optimalityt Equations are non-linear, and has no closed form solution(in general), therefore cannot be directly solved, we need to use other methods.

Dynamic Programming iteratively solves large scale questions by decomposite them into smaller ones, those questions have to be:

  • With Optimal Substructure
  • Overlapping Subproblems

MDP satisfy both propeerties. We can therefore use DP to solve MDP questions, note that DP solutions requires Full Knowledge of the MDP, and are hence Model-Based RL methods.

Policy Iteration

Policy Iteration method evaluate a given policy \(\pi\) by dynamic programming, it iteratively use Bellman Expectations to evaluate the state function of given policy \(\pi\). Specifically for each iteration \(k\):

\[v_{k+1}(s) = \sum_{a \in A} \pi(a|s) (R_s^a + \gamma \sum_{s' \in S} P_{ss'}^a v_k(s'))\]

To improve the policy, acting greedily with respect to \(v_\pi\): \(\pi' = greedy(v_\pi) = argmax_{a \in A} \ q_\pi(s,a)\)

The algorithm converges to \(v_*(s)\) with greedy policy imrovement, otherwise converges to real \(v_\pi(s)\).

Value Iteration

Based on Principle of Optimality, which states a policy \(\pi(s)\) is an optimal policy on state \(s\) if and only if \(\pi(s)\) achives \(v_\pi(s') = v_*(s')\) for any state \(s'\) that is reachable from \(s\). From which, it implies if we know the solution of \(v_*(s')\), we can figure out the optimal solution to any state \(s\) by One-Step Full Backup.

Formally, if we know the solution to subproblems \(v_*(s')\), the solution \(v_*(s)\) can be found be one-step lookahead:

\[v_*(s) = max_{a \in A} R_s^a + \gamma \sum_{s' \in S} p_{ss'}^a v_*(s')\]

The algorithm converges to \(v_*(s)\).

In summery:

Problem Bellman Equation Method (Algorithm)
Value Function Prediction Bellman Expectation Equation Policy Iteration
Control Bellman Expectation Equation Policy Iteration + Greedy Policy Improvement
Control Bellman Optimality Equation Value Iteration

Asynchronous Dynamic Programming

DP methods described above used synchronous backups, where all states are backed up in parallel. Asynchronous DP backs up states individually, in any order, can significantly reduce computation. It is guaranteed to converge if all states continue to be selected.

Three simple ideas for asynchronous dynamic programming:

  • In-place dynamic programming
  • Prioritised sweeping
  • Real-time dynamic programming

Model-free Value-based Methods

Dynamic Programming RL methods are all model based methods, in which we need specific environment model to excute them, it is common in real-world RL environment that we dont know environment model, Monte-Carlo / Temporal Difference methods provided algorithms that are model-free to predict value functions.


Monte-Carlo(MC) methods learn directly from episodes of experience, instead of evaluate policy by expected return(based on environment knowledge), it uses mean return(based on empirical knowledge) to estimate the value function.

The basic idea is, to evaluate state value for \(s\), the firstt/every time \(t\) when state \(s\) is visited in an episode, increment counter \(N(s) \leftarrow N(s) +1\), increment total return \(S(s) \leftarrow S(s) + G_t\), the estimated state value can be calculated as:

\[V(s) = \frac{S(s)}{N(s)}\]

By law of large numbers, \(V(s) \rightarrow v_\pi(s)\) as \(N(s) \rightarrow \infty\).

By incremental MC updates, the final MC evaluation equation can be written as:

\[V(s_t) \leftarrow V(S_t) + \alpha (G_t - V(S_t))\]

Temporal Difference

Temporal Difference(TD) method learns from incomplete eepisodes, in which the agent do not have to wait until finish whole episode to update value function, like did in MC. TD updates value function by leverage the differences between target and estimation in different time step. It uses idea of Bootstrapping with a biased estimation. Less precise than MC, but more convinent and with lower variance.

In Monte-Carlo methods, the value function is updated by:

\[V(s_t) \leftarrow V(S_t) + \alpha (G_t - V(S_t))\]

Alternatively in TD method, we update value of \(V(S_t)\) towards estimated return \(R_{t+1} + \gamma V(S_{t+1})\):

\[V(s_t) \leftarrow V(S_t) + \alpha ([R_{t+1} + \gamma V(S_{t+1})] - V(S_t))\]

Where \(R_{t+1} + \gamma V(S_{t+1})\) called TD target, and \(\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)\) called TD error.

TD can learn before(or without) knowing the final outcome, whereas in order for MC to learn, we need to wait until the termination of the episode which only works in episodic environments.

On-Policy Value-based Controls

MC based: Greedy Policy Improvements

Evaluate state-action value functions \(q_\pi(s,a)\) instead of state value function \(v_\pi(s)\):

\[q(s_t,a_t) \leftarrow q(s_t,a_t) + \alpha (G_t - q(s_t,a_t))\]

Improve policy by \(\epsilon\)-greedily selecting \(q(s,a)\).

TD based: Sarsa

Sarsa algorithm, replace MC by TD in control loop:

\[q(s_t,a_t) \leftarrow q(s_t,a_t) + \alpha (R + \gamma q(s_{t+1},a_{t+1}) - q(s_t,a_t))\]

Off-Policy Value-based Controls


  • Learn from observing humans or other agents.
  • Re-use experience generated from old policies \(\pi_1;\pi_2,....,\pi_{t-1}\).
  • Learn about optimal policy while following exploratory policy.
  • Learn about multiple policies while following one policy.

MC based: Importance Sampling

Gater trajectories from another policy distribution to update current distribution using a trick namely:

[Importance Sampling]

Estimate the expectation of distribution Q from P:

\[\mathbb{E}_{X \backsim P}[f(x)] = \sum P(x) f(x) = \sum Q(x) \frac{P(x)}{Q(x)} f(x) = \mathbb{E}_{X \backsim Q}[\frac{P(x)}{Q(x)}f(x)]\]

Define \(G_t^{\pi/\mu} = \frac{\pi(A_t|S_t)}{\mu(A_t|S_t)} \frac{\pi(A_{t+1}|S_{t+1})}{\mu(A_{t+1}|S_{t+1})} ... \frac{\pi(A_T|S_T)}{\mu(A_T|S_T)} G_t\), update value towards corrected return: \(V(S_t) \leftarrow V(S_t) + \alpha(G_t^{\pi/\mu} - V(S_t))\)

Note that importance sampling can dramatically increase variance. This mechanism can also be applied to TD:

\[V(S_t) \leftarrow V(S_t) + \alpha( \frac{\pi(A_t|S_t)}{\mu(A_t|S_t)} (R_{t+1} + \gamma V(S_{t+1})) - V(S_t))\]

TD based: Q-Learning

Instead of using target value based on current policy \(\pi\), the target value in Q-Learning beased on greedy policy over state-action value function:

\[q(s_t,a_t) \leftarrow q(s_t,a_t) + \alpha (R + \gamma \ max_a q(s_{t+1},a) - q(s_t,a_t))\]

Model-free Policy Based methods


  • Better convergence properties.
  • Effective in high-dimensional or continuous action spaces.
  • Can learn stochastic policies.


  • Typically converge to a local rather than global optimum.
  • Evaluating a policy is typically inefficient and high variance.

Gradient Based: Policy Gradient

Define that in a MDP environment, total reward gain from a certain policy \(\pi\) can be shown as:

\[\bar{R}_\theta = \sum_\tau R(\tau)p_\theta(\tau) \ \ \ _{(1)} \\ where \ \ p_\theta = p(s_1) \prod_{t=1}^{T} p_\theta(a_t \ given \ s_t) p(s_{t+1}|s_t,a_t) \ \ \ _{(2)}\]

In which policy \(\pi\) is parameterised by \(\theta\), and \(\tau\) represents a single trajectory, \(p_\theta\) is the probability of the trajectory.

Since equation (1) involves reward of trajectory \(R(\tau)\) times the probability of trajectory \(\tau\) following policy \(\theta\), noted as \(p_\theta\), as well as the summing operation \(\sum_\tau\), it can be seen as the expectation of reward for a certain policy:

\[\mathbb{E}_{\tau \backsim p_\theta(\tau)}[R(\tau)]\]

Hence we want to maximize the expectation of reward for a certain policy, to achive this, calculate the gradient of the function and perform gredient acsent:

\[\begin{aligned} \nabla_\theta \mathbb{E}_{\tau \backsim p_\theta(\tau)}[R(\tau)] = \nabla_\theta \int_{\tau_t} R_t \ p_\theta(\tau_t) \ \ d\tau_t \\ = \int_{\tau_t} R_t \ \nabla_\theta \ p_\theta (\tau_t) \ \ d\tau_t \\ = \int_{\tau_t} R_t \ p_\theta(\tau_t) \ \nabla_\theta \ log p_\theta (\tau_t) \ \ d\tau_t \\ = \mathbb{E}_{\tau \backsim p_\theta(\tau)}[R_t \ \nabla_\theta \ log p_\theta (\tau_t)] \end{aligned}\]

\(\mathbb{E}_{\tau \backsim p_\theta(\tau)}[R_t \ \nabla_\theta \ log p_\theta (\tau_t)]\) can be approximated by collecting experience as much as possible and compute the average:

\[\mathbb{E}_{\tau \backsim p_\theta(\tau)}[R(\tau) \ \nabla_\theta \ log p_\theta (\tau)] \approx \frac{1}{N} \sum_{n=1}^N R(\tau^n) \ \nabla_\theta \ log p_\theta (\tau^n)\]

Since we dont have full model for \(p_\theta\), it is not possible to compute equation (2), that is, we dont know \(p(s_{t+1} \ given \ s_t,a_t)\), because this term depends on the environment. Here for gradient acsent with respect to \(\theta\), we only need \(\nabla log \ p_\theta(\tau_t)\) instead of the value of \(log \ p_\theta(\tau_t)\) itself, therefore simply replace \(p_\theta(\tau_t)\) by \(\pi_\theta(a_t \ given \ s_t)\) we get:

\[\mathbb{E}_{\tau \backsim p_\theta(\tau)}[R(\tau) \ \nabla_\theta \ \sum_{t=1}^{T_n} log \ \pi_\theta(a_t | s_t)] \approx \frac{1}{N} \sum_{n=1}^N \sum_{t=1}^{T_n} R(\tau^n) \ \nabla_\theta \ log \ \pi_\theta(a_t^n \ given \ s_t^n)\]

After obtaining the gradient of objective function, policy parameters \(\theta\) are updated by gradient acsent:

\[\theta \leftarrow \theta + \alpha \nabla \bar{R}_\theta \\ where \ \ \nabla \bar{R}_\theta = \frac{1}{N} \sum_{n=1}^N \sum_{t=1}^{T_n} R(\tau^n) \ \nabla_\theta \ log \ \pi_\theta(a_t^n \ given \ s_t^n)\]

Tips1-Add a Baseline

It is possible that in a specific reinforcement learning environment, that \(R(\tau^n)\) is always positive. In this case, we might monotoniclly increase the probability of a certain action, this can be solved by adding a baseline to our equation, so that instead of naively taking rewards feedback from environment, we compare it to the average rewards we have and make the reward be relative to all previous rewards:

\[\nabla \bar{R}_\theta = \frac{1}{N} \sum_{n=1}^N \sum_{t=1}^{T_n} (R(\tau^n)-b) \ \nabla_\theta \ log \ \pi_\theta(a_t^n | s_t^n) \\ where \ \ b \approx \mathbb{E}[R(\tau)]\]

In other words, instead of rewarding trajectory by only the environment rewards, we reward a trajectory by how looking at how good this trajectory is, comparing with all other collected trajectories, since all actions in a same trajectory are being weighted by same reward, yet those actions might benefits for different amount.

To address this, we weight the \(a_t\) by the reward obtained from time \(t\), add a discount factor to rewards obtained in later stages.

\[\nabla \bar{R}_\theta = \frac{1}{N} \sum_{n=1}^N \sum_{t=1}^{T_n} ([\sum_{t^` = t}^{T_n} \gamma^{(t^`-t)} \cdot r_{t^`}^n ] -b) \ \nabla_\theta \ log \ \pi_\theta(a_t^n | s_t^n) \\ where \ \ b \approx \mathbb{E}[R(\tau)]\]

Tips2-Assign Suitable Credit

The current version of objective function evaluates the whole trajectory by the total rewards obtained from the environment, it is reasonable, but assumes in-precise correlations between each actions in the trajectory.

Actor-Critic: Integrating Value-based & Policy-based

The above PG(Policy Gradient) algorithm is evaluating the policy by MC-style critic(i.e. mean expected reward returned by the environment), in Actor-Critic, we define a critic:

\[Q_w(s,a) \approx Q^{\pi_\theta}(s,a)\]

Where the critic approximates state-action value function \(Q(s,a)\), the actor approximates the policy \(\pi\), there are parameterized by \(w\) and \(\theta\) respectively.

Actor-critic algorithm follow an approximate policy gradient, the actor network can be updated by:

\[\nabla_\theta J(\theta) \approx \mathbb{E}_{\pi_\theta}[\nabla_\theta log \pi_\theta(s,a) Q_w(s,a)] \\ \nabla \theta = \alpha \nabla_\theta log \pi_\theta(s,a) Q_w(s,a) \\ \theta_{t+1} \leftarrow \theta_t + \lambda \nabla\theta_t\]

The critic network is based on critic functions, here use Q-function as an example, the critic network can be updated as:

\[\nabla_w J(w) \approx \mathbb{E}_{\pi_\theta}[MSE(Q_{t},R_{t+1}+max_a Q_{t+1})] \\ \nabla w = MSE(Q_{t},R_{t+1}+max_a Q_{t+1}) \\ w_{t+1} \leftarrow w_t + \lambda \nabla w_t\]

Instead of leting critic to estimate state-value function, we can allow it to alternatively estimates Advantage function \(A(s,a) = Q_w(s,a) - V_v(s)\) to reduce the variance. There are many alternative critic function choices.

Continuous Action Space

Methods proposed so far only solves for environments that are with discrete action space, so that value functions for each actions or the probability distribution of selecting actions could be computed. However, in real world, most of the problem are with continous action space.

Deep Derministic Policy Gradient

Deep Derministic Policy Gradient is then proposed, it is able to solve RL environments with continuous action space by integrating ideas of DQN and PG. It can be viewed as extended version of DQN that is able to solve problems with continous action space. DDPG algorithm uses Actor-Critic architecture.

In DQN, in order to evaluate value function, we need \(max_aQ_{t+1}\), it is not possible to compute such value in continous action space. Instead of inputing only the state into critic network and obtain the Q-values for all actions, DDPG critic network takes next action computed from actor network as well, and evaluate Q value for this certain action. Updating of DDPG critic network is the same to DQN:

\[\nabla_w J(w) \approx \mathbb{E}_{\pi_\theta}[MSE(Q_{t},R_{t+1}+ Q_{t+1}^{a \backsim Actor})] \\ \nabla w = MSE(Q_{t},R_{t+1}+Q_{t+1}^{a \backsim Actor}) \\ w_{t+1} \leftarrow w_t - \lambda \nabla w_t\]

Intuitively, in DDPG, the actor network performs differently as the one in PG, it is not possible for it to compute probalibity distributions for all actions in continous action space, therefore we alter the network to output a certain action that could be with max Q value. The target of the actor network in DDPG is to maximize the value of \(Q_t(s,a)\) evaluated by critic network, therefore to update actor network, we use gradient acsent:

\[pg = \frac{\partial Q(s,\pi(s;\theta),w)}{\partial \theta} = \frac{\partial Q(s,a,w)}{\partial a} \cdot \frac{\partial a}{\partial \theta} \\ \theta \leftarrow \theta + \bar{\lambda} \theta\]

Note that in DDPG, tricks like target networks for both AC networks; memory buffer are being used. We also add a environmental noise \(N\) when performing actions to allow exploration, as well as off-policy learning.

Proximal Policy Optimization

Baseline algorithm of OpenAI. PPO allows off-policy learning to policy gradient algorithm. In policy gradient, we update our policy network by compute gradient of the expected reward function with respect to policy parameters \(\theta\), and perform gradient acsent to maximize it:

\[\nabla \bar{R}_\theta = \frac{1}{N} \sum_{n=1}^N \sum_{t=1}^{T_n} R(\tau^n) \ \nabla_\theta \ log \ \pi_\theta(a_t^n | s_t^n) \\ = \mathbb{E}_{\tau \backsim \pi_\theta}[R(\tau^n) \ \nabla_\theta \ log \ \pi_\theta(a_t^n | s_t^n)] \\ \theta \leftarrow \theta + \alpha \nabla \bar{R}_\theta\]

In PPO algorithm, instead of sampling trajectories from policy \(\pi_\theta\), in order to increase sample efficiency(reuse expirence), we sample trajectories from another policy \(\pi_{\theta'}\) and apply a importance sampling method to correct the difference.

\[\nabla \bar{R_\theta} = \mathbb{E}_{\tau \backsim \pi_{\theta'}}[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta'}(a_t|s_t)} R(\tau^n) \ \nabla_\theta \ log \ \pi_\theta(a_t^n given s_t^n)]\]

In addition, we need to add a regularzation term (or in TRPO, add a constrain) to the objective function to constrain the difference between two distributions, therefore the objective function becomes:

\(J(\theta) = J_{\theta'}(\theta) - \beta KL(\theta,\theta')\) Where adaptively set the value of \(\beta\), specificlly when \(KL(\theta,\theta') > KL_{max}\), increase \(\beta\); when \(KL(\theta,\theta') < KL_{min}\), decrease \(\beta\).