1. Theory
2. Implementation
3. Debugging
4. Full DQN
5. Double DQN and Prioritized experience replay
Introduction
Last time we implemented a Full DQN based agent with target network and reward clipping. In this article we will explore two techniques, which will help our agent to perform better, learn faster and be more stable – Double Learning and Prioritized Experience Replay.
Double Learning
One problem in the DQN algorithm is that the agent tends to overestimate the Q function value, due to the max in the formula used to set targets:
To demonstrate this problem, let’s imagine a following situation. For one particular state there is a set of actions, all of which have the same true Q value. But the estimate is inherently noisy and differs from the true value. Because of the max in the formula, the action with the highest positive error is selected and this value is subsequently propagated further to other states. This leads to positive bias – value overestimation. This severe impact on stability of our learning algorithm^{1}.
A solution to this problem was proposed by Hado van Hasselt (2010)^{2} and called Double Learning. In this new algorithm, two Q functions – and – are independently learned. One function is then used to determine the maximizing action and second to estimate its value. Either or is updated randomly with a formula:
or
It was proven that by decoupling the maximizing action from its value in this way, one can indeed eliminate the maximization bias^{2}.
When thinking about implementation into the DQN algorithm, we can leverage the fact that we already have two different networks giving us two different estimates and (target network). Although not really independent, it allows us to change our algorithm in a really simple way.
The original target formula would change to:
Translated to code, we only need to change one line to get the desired improvements:
t[a] = r + GAMMA * pTarget_[i][ numpy.argmax(p_[i]) ]
The Deep Reinforcement Learning with Double Q-learning^{1} paper reports that although Double DQN (DDQN) does not always improve performance, it substantially benefits the stability of learning. This improved stability directly translates to ability to learn much complicated tasks.
When testing DDQN on 49 Atari games, it achieved about twice the average score of DQN with the same hyperparameters. With tuned hyperparameters, DDQN achieved almost four time the average score of DQN. Summary of the results is shown in a table in the next section.
Prioritized Experience Replay
One of the possible improvements already acknowledged in the original research^{2} lays in the way experience is used. When treating all samples the same, we are not using the fact that we can learn more from some transitions than from others. Prioritized Experience Replay^{3} (PER) is one strategy that tries to leverage this fact by changing the sampling distribution.
The main idea is that we prefer transitions that does not fit well to our current estimate of the Q function, because these are the transitions that we can learn most from. This reflects a simple intuition from our real world – if we encounter a situation that really differs from our expectation, we think about it over and over and change our model until it fits.
We can define an error of a sample S = (s, a, r, s’) as a distance between the Q(s, a) and its target T(S):
For DDQN described above, T it would be:
We will store this error in the agent’s memory along with every sample and update it with each learning step.
One of the possible approaches to PER is proportional prioritization. The error is first converted to priority using this formula:
Epsilon is a small positive constant that ensures that no transition has zero priority.
Alpha, , controls the difference between high and low error. It determines how much prioritization is used. With we would get the uniform case.
Priority is translated to probability of being chosen for replay. A sample i has a probability of being picked during the experience replay determined by a formula:
The algorithm is simple – during each learning step we will get a batch of samples with this probability distribution and train our network on it. We only need an effective way of storing these priorities and sampling from them.
Initialization and new transitions
The original paper says that new transitions come without a known error^{3}, but I argue that with definition given above, we can compute it with a simple forward pass as it arrives. It’s also effective, because high value transitions are discovered immediately.
Another thing is the initialization. Remember that before the learning itself, we fill the memory using random agent. But this agent does not use any neural network, so how could we estimate any error? We can use a fact that untrained neural network is likely to return a value around zero for every input. In this case the error formula becomes very simple:
The error in this case is simply the reward experienced in a given sample. Indeed, the transitions where the agent experienced any reward intuitively seem to be very promising.
Efficient implementation
So how do we store the experience and effectively sample from it?
A naive implementation would be to have all samples in an array sorted according to their priorities. A random number s, , would be picked and the array would be walked left to right, summing up a priority of the current element until the sum is exceeded and that element is chosen. This will select a sample with the desired probability distribution.
But this would have a terrible efficiency: O(n log n) for insertion and update and O(n) for sampling.
A first important observation is that we don’t have to actually store the array sorted. Unsorted array would do just as well. Elements with higher priorities are still picked with higher probability.
This releases the need for sorting, improving the algorithm to O(1) for insertion and update.
But the O(n) for sampling is still too high. We can further improve our algorithm by using a different data structure. We can store our samples in unsorted sum tree – a binary tree data structure where the parent’s value is the sum of its children. The samples themselves are stored in the leaf nodes.
Update of a leaf node involves propagating a value difference up the tree, obtaining O(log n). Sampling follows the thought process of the array case, but achieves O(log n). For a value s, , we use the following algorithm (pseudo code):
def retrieve(n, s): if n is leaf_node: return n if n.left.val >= s: return retrieve(n.left, s) else: return retrieve(n.right, s - n.left.val)
Following picture illustrates sampling from a tree with s = 24:
With this effective implementation we can use large memory sizes, with hundreds of thousands or millions of samples.
For known capacity, this sum tree data structure can be backed by an array. Its reference implementation containing 50 lines of code is available on GitHub.
Results
Tests performed on 49 Atari games showed that PER really translates into faster learning and higher performance^{3}. What’s more, it’s complementary to DDQN.
Following table summarizes the results for 49 Atari games benchmark. Values are taken from Schaul et al. (2015)^{3} and rescaled. Because the DQN+PER value for proportional PER was not available, the provided value is for similar, but different rank-based PER. For specifics of how these tests were performed, look into the paper^{3}.
DQN | DQN+PER | DDQN | DDQN+PER |
---|---|---|---|
100% | 291% | 343% | 451% |
Implementation
An implementation of DDQN+PER for an Atari game Seaquest is available on GitHub. It’s an improvement over the DQN code presented in last chapter and should be easy to understand.
The DQN architecture from the original paper^{4} is implemented, although with some differences. In short, the algorithm first rescales the screen to 84×84 pixels and extracts luminance. Then it feeds last two screens as an input to the neural network. This ensures that the algorithm is also aware of a direction of movement of the game elements, something which would not be possible with only a current screen as input. Experience is stored in a sum tree with capacity of 200 000 samples. Neural network uses three convolutional layers and one dense hidden layer with following parameters:
Convolution2D(32, 8, 8, subsample=(4,4), activation='relu') Convolution2D(64, 4, 4, subsample=(2,2), activation='relu') Convolution2D(64, 3, 3, activation='relu') Dense(output_dim=512, activation='relu') Dense(output_dim=actionCount, activation='linear')
These hyperparameters were used:
Parameter | Value |
---|---|
memory capacity | 200000 |
batch size | 32 |
γ | 0.99 |
exploration ε_{max} | 1.00 |
exploration ε_{min} | 0.10 |
final exploration frame | 500000 |
PER α | 0.6 |
PER ε | 0.01 |
RMSprop learning rate | 0.00025 |
target network update frequency | 10000 |
It is possible to run this program on a regular computer, however it is very resource demanding. It takes about 12 GB of RAM and fully utilizes one core of CPU and whole GPU to slowly improve. In my computer it runs around 20 FPS. After about 12 hours, 750 000 steps and 700 episodes, it reached an average reward of 263 (mean reward of a random agent is 87). You can see it in action here:
To get better results, you have to run it for at least tens of millions of steps. The following graph shows that the improvement is indeed very slow:
Conclusion
We addressed Double Learning and Prioritized Experience Replay techniques that both substantially improve the DQN algorithm and can be used together to make a state-of-the-art algorithm on the Atari benchmark (at least as of 18 Nov 2015 – the day Prioritized Experience Replay^{3} article was published).
This articles finishes the Let’s make a DQN series. It was meant as a simplified tutorial for those who don’t want to read whole research articles but still want to understand what DQN is and what it does. I also hope that it sparkled your interest in this interesting direction of AI and that you will want to learn even more now.
I hope you enjoyed these articles at least as me writing them and that you learned something. If you have any questions or have anything to add, please feel free to leave a comment or contact me at author@jaromiru.com.
- Hado van Hasselt, Arthur Guez, David Silver – Deep Reinforcement Learning with Double Q-learning, arXiv:1509.06461, 2016 ↩ ↩
- Hado van Hasselt – Double Q-learning, Advances in Neural Information Processing Systems, 2010 ↩ ↩ ↩
- Schaul et al. – Prioritized Experience Replay, arXiv:1511.05952, 2015 ↩ ↩ ↩ ↩ ↩ ↩
- Mnih et al. – Human-level control through deep reinforcement learning, Nature 518, 2015 ↩
ヤロミル Thank you for the article. It’s very helpful. What I’m struggling with is why, in the Seaquest DDQN code, you sample from the SumTree using segments the way you do. Specifically lines 102 – 114 (https://github.com/jaara/AI-blog/blob/master/Seaquest-DDQN-PER.py#L102). Can you explain why the segments are necessary? Couldn’t one simply choose n random samples, each time using a random number between 0 and self.tree.total()?
Hi Jaromiru, absolutely love the ddqn series, definitely some of the best material on the subject that I have found. I was wondering if there would be a noticeable difference if you use 4 stacked images in the CNN instead of 2? I tried this on my laptop but it seemed to slow everything own a lot. This makes sense as the CNN has to do twice the work. I was wondering if it would be worth the extra training time or if the extra stacks would make a noticeable difference? Thanks
Yes. In fact, four stacked images were used in the original DQN paper and this was definitely done for a reason. Therefore we can assume, that four stacked images would work better.
I have some problems. In cartpole environment, what is the transition model or transition probability, should it be 1, that is P(s’|s, a) = 1? Can you give me a MDP define for cartpole and breakout? Really thanks to it.
In the OpenAI gym cartpole environment the state transition probability matrix is going to be 1 because the physics model (i.e. the set of equations in the cartpole environment) are deterministic. But you don’t typically care to know the state transition probability matrix explicitly when doing Q learning since the network learns those through experience that it’s sampled.
Awesome post, it really helps me, thanks very much!
It’s not a thing about difficulties. You are using Q learning, which is depended on the assumption about MDP. If you don’t use history stack, the agent can not know the moving direction of a particle on the screen, so MDP is not satisfied. Anyway, you use two images in one state in your code. But this will consume about (84*84*2*2+2)*4*200000 Bytes = 21GB memory, not 12GB memory.
Samples in the memory are duplicated. Say a sample (s0, r, a, s1) and (s1, r, a, s2) both share s1, which occupies only one part of the *physical* memory.
So the formula is 84×84 x 4 (float32) x 2 (per state) x 200 000 ~ 10.5 GB
Thinking about it, you can actually decouple the images in the state itself. Changing line 237:
s_ = numpy.array([s[1], processImage(img)]) #last two screens
from numpy array to python list would half the memory requirements (some code change needed later, possible performance degradation).
But in your code L237-L242, you actually store four images, two for s and two for s_. For memory requirement, this problem is also troubling me. I find an interesting implement here https://github.com/devsisters/DQN-tensorflow/blob/master/dqn/replay_memory.py.
About the MDP property – you’re right. I have limited resources and so I decided to reduce the frame count in a state from 4 to 2, halving the memory requirements, while still retaining the agent’s ability to understand motion.
I’m not trying to replicate the papers here, rather explain the concept.
Hi. I notice that your implement is not using history stack. Because using history stack will use much more memory. That’s different from the original article.
What do you mean by history stack? If you mean the last 4 frames stacked together, you are right. But that’s easy to implement and I didn’t want to introduce more difficulties here.
Most likely I’m getting something wrong but may I ask:
In your mountain car DQN implementationt this is line 170:
[a] = r + GAMMA * p_[i][ numpy.argmax(p[i]) ]
Shouldn’t it be numpy.argmax(p_[i])? The max Q-Value predicted from the Online Network for the newstate instead of the current state?
Nice catch! You are referring to a file open_gym/MountainCar-v0.py in my other repository, I assume.
I actually already corrected this mistake in other files. But forgot about this one.
It should be:
=====
p = agent.brain.predict(states)
p_ = agent.brain.predict(states_, target=False)
pTarget_ = agent.brain.predict(states_, target=True)
====
t[a] = r + GAMMA * pTarget_[i][ numpy.argmax(p_[i]) ] # double DQN
===
The solution you proposed leads to a regular DQN.
Very interesting read! This makes it very easy to understand. Are you planning. Are you planning on writing posts about more RL subjects? I would love to read more intuitive articles about policy gradients methods. Good luck with your PhD!
Thank you! More articles are definitely possible, although not in the very near future.