## On using Huber loss in (Deep) Q-learning

I’ve been recently working on a problem where I put a plain DQN to use. The problem is very simple, deterministic, partially observable and states are quite low-dimensional. The agent however can’t tell some states apart and so it’s effectively stochastic in the eyes of the agent.

Because the problem was quite simple, I just expected the network to learn very good representation of the Q-function over the whole state space.

And I was surprised that this vanilla DQN totally failed in this problem. Not in the sense it was too difficult, on the contrary – the algorithm converged and was highly certain on all the Q-values it found. But these Q-values were totally wrong. I couldn’t get my head around it, but then I tracked it down to a simple cause: Pseudo-Huber loss.

Edit: Based on the discussion, the original Huber loss with appropriate δ parameter is correct to use. The following article however stays true for L1 and pseudo-huber loss.

## Let’s make an A3C: Implementation

### Introduction

In the previous article we built necessary knowledge about Policy Gradient Methods and A3C algorithm. This time we implement a simple agent with our familiar tools – Python, Keras and OpenAI Gym. However, more low level implementation is needed and that’s where TensorFlow comes to play.

The environment is the same as in DQN implementation – CartPole. Final code fits inside 300 lines and is easily converted to any other problem. A3C algorithm is very effective and learning takes only 30 seconds on a regular notebook.

## Let’s make an A3C: Theory

### Introduction

Policy Gradient Methods is an interesting family of Reinforcement Learning algorithms. They have a long history1, but only recently were backed by neural networks and had success in high-dimensional cases. A3C algorithm was published in 2016 and can do better than DQN with a fraction of time and resources2.

In this series of articles we will explain the theory behind Policy Gradient Methods, A3C algorithm and develop a simple agent in Python.

## Let’s make a DQN: Double Learning and Prioritized Experience Replay

### Introduction

Last time we implemented a Full DQN based agent with target network and reward clipping. In this article we will explore two techniques, which will help our agent to perform better, learn faster and be more stable – Double Learning and Prioritized Experience Replay.

## Let’s make a DQN: Full DQN

### Introduction

Up until now we implemented a simple Q-network based agent, which suffered from instability issues. In this article we will address these problems with two techniques – target network and error clipping. After implementing these, we will have a fully fledged DQN, as specified by the original paper1.

## Let’s make a DQN: Debugging

### Introduction

Last time we saw that our Q-learning can be unstable. In this article we will cover some methods that will help us to understand what is going on inside the network.