Timezone: »
For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s1,s2 there is a policy which moves from s1 to s2 in at most D steps (on average). We present a reinforcement learning algorithm with total regret O(DSAT) after T steps for any unknown MDP with S states, A actions per state, and diameter D. This bound holds with high probability. We also present a corresponding lower bound of Omega(DSAT) on the total regret of any learning algorithm. Both bounds demonstrate the utility of the diameter as structural parameter of the MDP.
Author Information
Peter Auer (Montanuniversitaet Leoben)
Thomas Jaksch (University of Leoben)
Ronald Ortner (Montanuniversitaet Leoben)
Related Events (a corresponding poster, oral, or spotlight)
-
2008 Poster: Near-optimal Regret Bounds for Reinforcement Learning »
Wed. Dec 10th through Tue the 9th Room
More from the Same Authors
-
2014 Workshop: Autonomously Learning Robots »
Gerhard Neumann · Joelle Pineau · Peter Auer · Marc Toussaint -
2011 Poster: PAC-Bayesian Analysis of Contextual Bandits »
Yevgeny Seldin · Peter Auer · Francois Laviolette · John Shawe-Taylor · Ronald Ortner -
2006 Workshop: On-line Trading of Exploration and Exploitation »
Peter Auer -
2006 Poster: Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning »
Peter Auer · Ronald Ortner