In some stochastic environments the wellknown reinforcement learning algorithm Qlearning performs very poorly. This poor performance is caused by large overestimations of action values. These overestimations result from a positive bias that is introduced because Qlearning uses the maximum action value as an approximation for the maximum expected action value. We introduce an alternative way to approximate the maximum expected value for any set of random variables. The obtained double estimator method is shown to sometimes underestimate rather than overestimate the maximum expected value. We apply the double estimator to Qlearning to construct Double Qlearning, a new offpolicy reinforcement learning algorithm. We show the new algorithm converges to the optimal policy and that it performs well in some settings in which Qlearning performs poorly due to its overestimation.
Author Information
Hado P van Hasselt (Centrum Wiskunde & Informatica (CWI))
More from the Same Authors

2014 Poster: Weighted importance sampling for offpolicy learning with linear function approximation »
Rupam Mahmood · Hado P van Hasselt · Richard Sutton