Timezone: »
We derive an equation for temporal difference learning from statistical first principles. Specifically, we start with the variational principle and then bootstrap to produce an updating rule for discounted state value estimates. The resulting equation is similar to the standard equation for temporal difference learning with eligibility traces, so called TD(λ), however it lacks the parameter α that specifies the learning rate. In the place of this free parameter there is now an equation for the learning rate that is specific to each state transition. We experimentally test this new learning rule against TD(λ) and find that it offers superior performance in various settings. Finally, we make some preliminary investigations into how to extend our new temporal difference algorithm to reinforcement learning. To do this we combine our update equation with both Watkin’s Q(λ) and Sarsa(λ) and find that it again offers superior performance with fewer parameters.
Author Information
Marcus Hutter (Australian National University)
Shane Legg (DeepMind)
More from the Same Authors
-
2020 Poster: Meta-trained agents implement Bayes-optimal agents »
Vladimir Mikulik · Grégoire Delétang · Tom McGrath · Tim Genewein · Miljan Martic · Shane Legg · Pedro Ortega -
2020 Poster: Avoiding Side Effects By Considering Future Tasks »
Victoria Krakovna · Laurent Orseau · Richard Ngo · Miljan Martic · Shane Legg -
2020 Spotlight: Meta-trained agents implement Bayes-optimal agents »
Vladimir Mikulik · Grégoire Delétang · Tom McGrath · Tim Genewein · Miljan Martic · Shane Legg · Pedro Ortega -
2018 Poster: Reward learning from human preferences and demonstrations in Atari »
Borja Ibarz · Jan Leike · Tobias Pohlen · Geoffrey Irving · Shane Legg · Dario Amodei -
2017 Poster: Deep Reinforcement Learning from Human Preferences »
Paul Christiano · Jan Leike · Tom Brown · Miljan Martic · Shane Legg · Dario Amodei -
2009 Mini Symposium: Partially Observable Reinforcement Learning »
Marcus Hutter · Will Uther · Pascal Poupart