Timezone: »

Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation
Hamid R Maei · Csaba Szepesvari · Shalabh Batnaghar · Doina Precup · David Silver · Richard Sutton

Mon Dec 07 06:45 PM -- 06:46 PM (PST) @ None
We introduce the first temporal-difference learning algorithms that converge with smooth value function approximators, such as neural networks. Conventional temporal-difference (TD) methods, such as TD($\lambda$), Q-learning and Sarsa have been used successfully with function approximation in many applications. However, it is well known that off-policy sampling, as well as nonlinear function approximation, can cause these algorithms to become unstable (i.e., the parameters of the approximator may diverge). Sutton et al (2009a,b) solved the problem of off-policy learning with linear TD algorithms by introducing a new objective function, related to the Bellman-error, and algorithms that perform stochastic gradient-descent on this function. In this paper, we generalize their work to nonlinear function approximation. We present a Bellman error objective function and two gradient-descent TD algorithms that optimize it. We prove the asymptotic almost-sure convergence of both algorithms for any finite Markov decision process and any smooth value function approximator, under usual stochastic approximation conditions. The computational complexity per iteration scales linearly with the number of parameters of the approximator. The algorithms are incremental and are guaranteed to converge to locally optimal solutions.

Author Information

Hamid R Maei (Stanford University)
Csaba Szepesvari (DeepMind / University of Alberta)
Shalabh Batnaghar
Doina Precup (McGill University / Mila / DeepMind Montreal)
David Silver (DeepMind)
Rich Sutton (DeepMind, U Alberta)

Richard S. Sutton is a professor and iCORE chair in the department of computing science at the University of Alberta. He is a fellow of the Association for the Advancement of Artificial Intelligence and co-author of the textbook "Reinforcement Learning: An Introduction" from MIT Press. Before joining the University of Alberta in 2003, he worked in industry at AT&T and GTE Labs, and in academia at the University of Massachusetts. He received a PhD in computer science from the University of Massachusetts in 1984 and a BA in psychology from Stanford University in 1978. Rich's research interests center on the learning problems facing a decision-maker interacting with its environment, which he sees as central to artificial intelligence. He is also interested in animal learning psychology, in connectionist networks, and generally in systems that continually improve their representations and models of the world.

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors