Many practitioners of reinforcement learning problems have observed that oftentimes the performance of the agent reaches very close to the optimal performance even though the estimated (action-)value function is still far from the optimal one. The goal of this paper is to explain and formalize this phenomenon by introducing the concept of the action-gap regularity. As a typical result, we prove that for an agent following the greedy policy (\hat{\pi}) with respect to an action-value function (\hat{Q}), the performance loss (E[V^(X) - V^{\hat{X}} (X)]) is upper bounded by (O(|| \hat{Q} - Q^||_\infty^{1+\zeta})), in which (\zeta >= 0) is the parameter quantifying the action-gap regularity. For (\zeta > 0), our results indicate smaller performance loss compared to what previous analyses had suggested. Finally, we show how this regularity affects the performance of the family of approximate value iteration algorithms.
Live content is unavailable. Log in and register to view live content