I’ve been recently working on a problem where I put a plain DQN to use. The problem is very simple, deterministic, partially observable and states are quite low-dimensional. The agent however can’t tell some states apart and so it’s effectively stochastic in the eyes of the agent.
Because the problem was quite simple, I just expected the network to learn very good representation of the Q-function over the whole state space.
And I was surprised that this vanilla DQN totally failed in this problem. Not in the sense it was too difficult, on the contrary – the algorithm converged and was highly certain on all the Q-values it found. But these Q-values were totally wrong. I couldn’t get my head around it, but then I tracked it down to a simple cause: Pseudo-Huber loss.
Edit: Based on the discussion, the original Huber loss with appropriate δ parameter is correct to use. The following article however stays true for L1 and pseudo-huber loss.
Continue reading On using Huber loss in (Deep) Q-learning
In the previous article we built necessary knowledge about Policy Gradient Methods and A3C algorithm. This time we implement a simple agent with our familiar tools – Python, Keras and OpenAI Gym. However, more low level implementation is needed and that’s where TensorFlow comes to play.
The environment is the same as in DQN implementation – CartPole. Final code fits inside 300 lines and is easily converted to any other problem. A3C algorithm is very effective and learning takes only 30 seconds on a regular notebook.
Continue reading Let’s make an A3C: Implementation
Policy Gradient Methods is an interesting family of Reinforcement Learning algorithms. They have a long history, but only recently were backed by neural networks and had success in high-dimensional cases. A3C algorithm was published in 2016 and can do better than DQN with a fraction of time and resources.
In this series of articles we will explain the theory behind Policy Gradient Methods, A3C algorithm and develop a simple agent in Python.
Continue reading Let’s make an A3C: Theory
Last time we implemented a Full DQN based agent with target network and reward clipping. In this article we will explore two techniques, which will help our agent to perform better, learn faster and be more stable – Double Learning and Prioritized Experience Replay.
Continue reading Let’s make a DQN: Double Learning and Prioritized Experience Replay
Up until now we implemented a simple Q-network based agent, which suffered from instability issues. In this article we will address these problems with two techniques – target network and error clipping. After implementing these, we will have a fully fledged DQN, as specified by the original paper.
Continue reading Let’s make a DQN: Full DQN
Last time we saw that our Q-learning can be unstable. In this article we will cover some methods that will help us to understand what is going on inside the network.
The code for this article can be found at github.
Continue reading Let’s make a DQN: Debugging
Last time we tried to get a grasp of necessary knowledge and today we will use it to build an Q-network based agent, that will solve a cart pole balancing problem, in less than 200 lines of code.
The complete code is available at github.
Continue reading Let’s make a DQN: Implementation