I’ve been recently working on a problem where I put a plain DQN to use. The problem is very simple, deterministic, partially observable and states are quite low-dimensional. The agent however can’t tell some states apart and so it’s effectively stochastic in the eyes of the agent.

Because the problem was quite simple, I just expected the network to learn very good representation of the Q-function over the whole state space.

And I was surprised that this vanilla **DQN totally failed in this problem**. Not in the sense it was too difficult, on the contrary – the **algorithm converged** and was highly certain on all the Q-values it found. But these Q-values were totally wrong. I couldn’t get my head around it, but then I tracked it down to a simple cause: **Pseudo-Huber loss**.

__Edit: Based on the discussion, the original Huber loss with appropriate δ parameter is correct to use. The following article however stays true for L1 and pseudo-huber loss.__