Let’s make a DQN: Implementation

This article is part of series Let’s make a DQN.

1. Theory
2. Implementation
3. Debugging
4. Full DQN
5. Double DQN and Prioritized experience replay


Last time we tried to get a grasp of necessary knowledge and today we will use it to build an Q-network based agent, that will solve a cart pole balancing problem, in less than 200 lines of code.

The complete code is available at github.

The problem looks like this:


There’s a cart (represented by the black box) with attached pole, that can rotate around one axis. The cart can only go left or right and the goal is to balance the pole as long as possible.

The state is described by four values – cart position, its velocity, pole angle and its angular velocity. At each time step, the agent has to decide whether to push the cart left or right.

If the pole is up, above a certain angle, the agent receives a reward of one. If it falls below this critical angle, the episode ends.

The important fact here is that the agent does not know any of this and has to use its experience to find a viable way to solve this problem. The algorithm is just given four numbers and has to correlate them with the two available actions to acquire as much reward as possible.


Our algorithm won’t be resource demanding, you can use any recent computer with Python 3 installed to reproduce results in this article. There are some libraries that you have to get before we start. You can install them easily with pip command:

    pip install theano keras gym

This will fetch and install Theano, a low level library for building artificial neural networks. But we won’t use it directly, instead we will use Keras as an abstraction layer. It will allow us to define our ANN in a compact way. The command also installs OpenAI Gym toolkit, which will provide us with easy to use environments for our agent.

The Keras library uses by default TensorFlow, which is a different library from Theano. To switch to Theano, edit ~/.keras/keras.json file so it contains this line:

    "backend": "theano",


Before we jump on to the code itself, let’s think about our problem in an abstract way. There’s an agent, which interacts with the environment through actions and observations. The environment reacts to agent’s actions and supply information about itself. The agent will store the encountered experience in its memory and use its (artificial) intelligence to decide what actions to take. The next diagram shows an abstraction of the problem:


Following our intuition, we will implement four classes – Environment, Agent, Brain and Memory, with these methods:

    run()           # runs one episode

    act(s)          # decides what action to take in state s 
    observe(sample) # adds sample (s, a, r, s_) to memory
    replay()        # replays memories and improves

    predict(s)      # predicts the Q function values in state s
    train(batch)    # performs supervised training step with batch

    add(sample)     # adds sample to memory
    sample(n)       # returns random batch of n samples

With this picture in mind, let’s look at each of these in more detail.


The Environment class is our abstraction for OpenAI Gym. Its only method run() handles one episode of the problem. Look at an excerpt from the run() function:

s = env.reset()
while True:            
    a = agent.act(s)
    s_, r, done, info = env.step(a)

    if done: 
        s_ = None

    agent.observe( (s, a, r, s_) )

    s = s_

    if done: 

In the main loop, the agent decides what action to take, based on given state. The environment then performs this action and returns a new state and a reward.

We will mark the end of the episode by setting the new state s_ to None. The actual state is not needed because there is one terminal state.

Then the agent observes the new sample (s, a, r, s’) and performs a learning step.


The Brain class encapsulates the neural network. Our problem is simple enough so we will use only one hidden layer of 64 neurons, with ReLU activation function. The final layer will consist of only two neurons, one for each available action. Their activation function will be linear. Remember that we are trying to approximate the Q function, which in essence can be of any real value. Therefore we can’t restrict the output from the network and the linear activation works well.

Instead of simple gradient descend, we will use a more sophisticated algorithm RMSprop, and Mean Squared Error (mse) loss function.

Keras library allows us to define such network surprisingly easily:

    model = Sequential()

    model.add(Dense(output_dim=64, activation='relu', input_dim=stateCount))
    model.add(Dense(output_dim=actionCount, activation='linear'))

    opt = RMSprop(lr=0.00025)
    model.compile(loss='mse', optimizer=opt)

The train(x, y) function will perform a gradient descend step with given batch.

    def train(self, x, y):
        model.fit(x, y, batch_size=64)

Finally, the predict(s) method returns an array of predictions of the Q function in given states. It’s easy enough again with Keras:

    def predict(self, s):

The purpose of the Memory class is to store experience. It almost feels superfluous in the current problem, but we will implement it anyway. It is a good abstraction for the experience replay part and will allow us to easily upgrade it to more sophisticated algorithms later on.

The add(sample) method stores the experience into the internal array, making sure that it does not exceed its capacity. The other method sample(n) returns n random samples from the memory.


Finally, the Agent class acts as a container for the agent related properties and methods. The act(s) method implements the ε-greedy policy. With probability epsilon, it chooses a random action, otherwise it selects the best action the current ANN returns.

    def act(self, s):
        if random.random() < self.epsilon:
            return random.randint(0, self.actionCnt-1)
            return numpy.argmax(self.brain.predictOne(s))

We decrease the epsilon parameter with time, according to a formula:

\varepsilon = \varepsilon_{min} + (\varepsilon_{max} - \varepsilon_{min}) e^{-\lambda t}

The λ parameter controls the speed of decay. This way we start with a policy that explores greatly and behaves more and more greedily over time.

The observe(sample) method simply adds a sample to the agent’s memory.

    def observe(self, sample):  # in (s, a, r, s_) format

The last replay() method is the most complicated part. Let’s recall, how the update formula looks like:

Q(s, a) \xrightarrow{} r + \gamma max_a Q(s', a)

This formula means that for a sample (s, r, a, s’) we will update the network’s weights so that its output is closer to the target.

But when we recall our network architecture, we see, that it has multiple outputs, one for each action.

Deep Q-network

We therefore have to supply a target for each of the outputs. But we want to adjust the ouptut of the network for only the one action which is part of the sample. For the other actions, we want the output to stay the same. So, the solution is simply to pass the current values as targets, which we can get by a single forward propagation.

Also, we have a special case of the episode terminal state. Remember that we set a state to None in the Environment class when the episode ended. Because of that, we can now identify such a state and act accordingly.

When the episode ends, there are no more states after and so our update formula reduces to

Q(s, a) \xrightarrow{} r

In this case, we will set our target only to r.

Let’s look at the code now. First we fetch a batch of samples from the memory.

    batch = self.memory.sample(BATCH_SIZE)

We can now make the predictions for all starting and ending states in our batch in one step. The underlaying Theano library will parallelize our code seamlessly, if we supply an array of states to predict. This results to a great speedup.

    no_state = numpy.zeros(self.stateCnt)

    states = numpy.array([ o[0] for o in batch ])
    states_ = numpy.array([ (no_state if o[3] is None else o[3]) for o in batch ])

    p = agent.brain.predict(states)
    p_ = agent.brain.predict(states_)

Notice the no_state variable. We use it as a dummy replacement state for the occasions where the final state is None. The Theano librarary cannot make a preditiction for a None state, therefore we just supply an array full of zeroes.

The p variable now holds predictions for the starting states for each sample and will be used as a default target in the learning. Only the one action passed in the sample will have the actual target of r + \gamma max_a Q(s', a).

The other p_ variable is filled with predictions for the final states and is used in the max_a Q(s', a) part of the formula.

We now iterate over all the samples and set proper targets for each sample:

    for i in range(batchLen):
        o = batch[i]
        s = o[0]; a = o[1]; r = o[2]; s_ = o[3]

        t = p[i]
        if s_ is None:
            t[a] = r
            t[a] = r + GAMMA * numpy.amax(p_[i])

        x[i] = s
        y[i] = t

Finally, we call our brain.train() method:

    self.brain.train(x, y)

To make our program more general, we won’t hardcode state space dimensions nor number of possible actions. Instead we retrieve them from the environment. In the main loop, we just run the episodes forever, until the user terminates with CTRL+C.

PROBLEM = 'CartPole-v0'
env = Environment(PROBLEM)

stateCnt  = env.env.observation_space.shape[0]
actionCnt = env.env.action_space.n

agent = Agent(stateCnt, actionCnt)

while True:

Although a neural network can in essence adapt to any input, normalizing it will typically help. We can sample a few examples of different states to see what they look like:

[-0.0048  0.0332 -0.0312  0.0339]
[-0.01   -0.7328  0.0186  1.1228]
[-0.166  -0.4119  0.1725  0.8208]
[-0.0321  0.0445  0.0284 -0.0034]

We see that they are essentially similar in their range, and their magnitude do not differ substantially. Because of that, we won’t normalize the input in this problem.


So far we talked about the components in a very generic way. We didn’t give any concrete values for the parameters mentioned in previous sections, except for the neural network architecture. Following parameters were chosen to work well in this problem:

Parameter Value
memory capacity 100000
batch size 64
γ 0.99
εmax 1.00
εmin 0.01
λ 0.001
RMSprop learning rate 0.00025


You can now download the full code and try it for yourself:

It’s remarkable that with less than 200 lines of code, we can implement a Q-network based agent. Although we used CartPole-v0 environment as an example, we can use the same code for many different environments, with only small adjustments. Also, our agent runs surprisingly fast. With disabled rendering, it achieves about 720 FPS on my notebook. Now, let’s look how well it actually performs.

The agent usually learns to hold the pole in straight position after some time, with varying accuracy. To be able to compare results, I ran several thousand episodes with a random policy, which achieved an average reward of 22.6. Then I averaged a reward over 10 runs with our agent, each with 1000 episodes with a limit of 2000 time steps:


We see that it is well above the average reward of a random policy, so we can conclude that our agent actually learned something. But these results are averaged over multiple runs and the celebrations would be premature. Let’s look at a single run and its reward:


We see that there are several spots where the performance drops suddenly to zero. Clearly, our learning algorithm is unstable. What could be causing these drops and why? Does the network overfit or diverge? Or is it something else? And more importantly, can we mitigate these issues?

As a matter of fact, we can. There are several points in our agent, which we can improve. But before we do that, we have to understand, what the underlaying neural network is doing. In the next section, we will get means to take a meaningful insight into the neural network itself, learn how to display the approximated Q function and use this knowledge to understand what is happening.

20 thoughts on “Let’s make a DQN: Implementation

  1. Great Article… I suggest the readers to run through the full code from github before the blog post if you have a basic idea of Q-function and python…
    The Code is itself like a Blog, its beginner friendly and neatly woven

  2. Hi Jaromír,
    Thank you for the article series. I’ve really enjoyed reading them and have learned a lot. I’ve been directing my friends learning RL here as well – it’s really a great resource.

    One quick question for you. The graphs you show – the axis is a little confusing. The x-axis looks to be episode number, but the y-axis is a bit strange. I thought it was reward initially. I see times though where the graph spikes over the maximum reward you can obtain in an episode, so I guess it’s not that.

    Any clarity would be appreciated! Thanks again for your work


    1. Hi, the environment gives a reward of 1 at every step and there is a limit of 2000 steps per episode in my settings. Does that explain your question?

  3. Hello! That is a great article! Do you know what we should change to adapt this program to another environment? What part in this program makes it specific to Cartpole? I tried to make it work on a different game (Continuous car) but i can’t find the issue :/
    Thank you very much !

  4. thanks… i am not intuitively clear on “p variable predictions for starting and final states” … what is meant by prediction for final state?

    1. I mean, I understand that it is done so to identify terminal state and set target value accordingly. Is there any other explanation other than this?

  5. Thanks for your introduction. I have reimplemented the DQN algorithm based on the newest Deep Mind paper(2015, Nature), the one with target network strecture. But compare with yours, the learning performance is not as good as yours. Actually, it seems not learning anything at all. I wonder whether there is something wrong?

    Here is my code: https://github.com/JieMEI1994/HelloWorld
    Could you please help me to find out the reason?

  6. could you please clarify, what do you mean by an episode and time steps, what’s the difference?

    Is this 1 episode with random `X` time steps?
    while True:
    “I ran several thousand episodes with a random policy, which achieved an average reward of 22.6. Then I averaged a reward over 10 runs with our agent, each with 1000 episodes with a limit of 2000 time steps”

    1. Yes. I agree that it is not the best metric, especially for comparing. The point was to show that the agent indeed learns, and it worked for that.

    2. Thank you for these great posts!
      I didn’t get it too.
      Can you clarify what are the steps, episodes and time steps?
      Is episode = one loop in the main, and step= the number of times you did the loop?
      [Generally episode is one game (until step=200 or the pole fall down) and step is one step in the game (max step=200)]

  7. Awesome !
    Very Interesting and illuminating
    I tried ‘adam’ optimiser based upon above comment by Sridharthi , and I found I could get to up to first batch of 200.0 scores, over around thirty episdoes in around 150 episodes. Impressive

  8. Hi,
    Great post. You might wanted to change your optimizer to ‘adam’ rather than setting alpha manually in RMSprop. This gave much better results.
    Cheers 🙂

  9. Very great article for DQN!
    I’m interested in the way of modeling the the reward, would you help to describe it in the later article?
    Thank you so much 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s