`

Timezone: »

Temporal-Difference Value Estimation via Uncertainty-Guided Soft Updates
Litian Liang · Yaosheng Xu · Stephen McAleer · Dailin Hu · Alexander Ihler · Pieter Abbeel · Roy Fox

@ None

Temporal-Difference (TD) learning methods, such as Q-Learning, have proven effective at learning a policy to perform control tasks. One issue with methods like Q-Learning is that the value update introduces bias when predicting the TD target of a unfamiliar state. Estimation noise becomes a bias after the max operator in the policy improvement step, and carries over to value estimations of other states, causing Q-Learning to overestimate the Q value. Algorithms like Soft Q-Learning (SQL) introduce the notion of a soft-greedy policy, which reduces the estimation bias via soft updates in early stages of training. However, the inverse temperature $\beta$ that controls the softness of an update is usually set by a hand-designed heuristic, which can be inaccurate at capturing the uncertainty in the target estimate. Under the belief that $\beta$ is closely related to the (state dependent) model uncertainty, Entropy Regularized Q-Learning (EQL) further introduces a principled scheduling of $\beta$ by maintaining a collection of the model parameters that characterizes model uncertainty. In this paper, we present Unbiased Soft Q-Learning (UQL), which extends the work of EQL from two action, finite state spaces to multi-action, infinite state space Markov Decision Processes. We also provide a principled numerical scheduling of $\beta$, extended from SQL and using model uncertainty, during the optimization process. We show the theoretical guarantees and the effectiveness of this update method in experiments on several discrete control environments.

#### Author Information

##### Pieter Abbeel (UC Berkeley & Covariant)

Pieter Abbeel is Professor and Director of the Robot Learning Lab at UC Berkeley [2008- ], Co-Director of the Berkeley AI Research (BAIR) Lab, Co-Founder of covariant.ai [2017- ], Co-Founder of Gradescope [2014- ], Advisor to OpenAI, Founding Faculty Partner AI@TheHouse venture fund, Advisor to many AI/Robotics start-ups. He works in machine learning and robotics. In particular his research focuses on making robots learn from people (apprenticeship learning), how to make robots learn through their own trial and error (reinforcement learning), and how to speed up skill acquisition through learning-to-learn (meta-learning). His robots have learned advanced helicopter aerobatics, knot-tying, basic assembly, organizing laundry, locomotion, and vision-based robotic manipulation. He has won numerous awards, including best paper awards at ICML, NIPS and ICRA, early career awards from NSF, Darpa, ONR, AFOSR, Sloan, TR35, IEEE, and the Presidential Early Career Award for Scientists and Engineers (PECASE). Pieter's work is frequently featured in the popular press, including New York Times, BBC, Bloomberg, Wall Street Journal, Wired, Forbes, Tech Review, NPR.

##### Roy Fox (UC Irvine)

[Roy Fox](http://roydfox.com/) is a postdoc at UC Berkeley working with [Ion Stoica](http://people.eecs.berkeley.edu/~istoica/) in the Real-Time Intelligent Secure Explainable lab ([RISELab](https://rise.cs.berkeley.edu/)), and with [Ken Goldberg](http://goldberg.berkeley.edu/) in the Laboratory for Automation Science and Engineering ([AUTOLAB](http://autolab.berkeley.edu/)). His research interests include reinforcement learning, dynamical systems, information theory, automation, and the connections between these fields. His current research focuses on automatic discovery of hierarchical control structures in deep reinforcement learning and in imitation learning of robotic tasks. Roy holds a MSc in Computer Science from the [Technion](http://www.cs.technion.ac.il/), under the supervision of [Moshe Tennenholtz](http://iew3.technion.ac.il/Home/Users/Moshet.phtml), and a PhD in Computer Science from the [Hebrew University](http://www.cs.huji.ac.il/), under the supervision of [Naftali Tishby](http://www.cs.huji.ac.il/~tishby/). He was an exchange PhD student with [Larry Abbott](http://www.cs.huji.ac.il/~tishby/) and [Liam Paninski](http://www.stat.columbia.edu/~liam/) at the [Center for Theoretical Neuroscience](http://www.neurotheory.columbia.edu/) at Columbia University, and a research intern at Microsoft Research.