Timezone: »
This work focuses on off-policy evaluation (OPE) with function approximation in infinite-horizon undiscounted Markov decision processes (MDPs). For MDPs that are ergodic and linear (i.e. where rewards and dynamics are linear in some known features), we provide the first finite-sample OPE error bound, extending the existing results beyond the episodic and discounted cases. In a more general setting, when the feature dynamics are approximately linear and for arbitrary rewards, we propose a new approach for estimating stationary distributions with function approximation. We formulate this problem as finding the maximum-entropy distribution subject to matching feature expectations under empirical dynamics. We show that this results in an exponential-family distribution whose sufficient statistics are the features, paralleling maximum-entropy approaches in supervised learning. We demonstrate the effectiveness of the proposed OPE approaches in multiple environments.
Author Information
Nevena Lazic (DeepMind)
Dong Yin (DeepMind)
Mehrdad Farajtabar (DeepMind)
Nir Levine (DeepMind)
DILAN Gorur (DeepMind)
Chris Harris (Google)
Dale Schuurmans (Google Brain & University of Alberta)
More from the Same Authors
-
2020 Poster: Learning to Incentivize Other Learning Agents »
Jiachen Yang · Ang Li · Mehrdad Farajtabar · Peter Sunehag · Edward Hughes · Hongyuan Zha -
2020 Poster: Understanding the Role of Training Regimes in Continual Learning »
Seyed Iman Mirzadeh · Mehrdad Farajtabar · Razvan Pascanu · Hassan Ghasemzadeh -
2020 Poster: Self-Distillation Amplifies Regularization in Hilbert Space »
Hossein Mobahi · Mehrdad Farajtabar · Peter Bartlett -
2020 Poster: Learning Discrete Energy-based Models via Auxiliary-variable Local Exploration »
Hanjun Dai · Rishabh Singh · Bo Dai · Charles Sutton · Dale Schuurmans -
2020 Poster: An Efficient Framework for Clustered Federated Learning »
Avishek Ghosh · Jichan Chung · Dong Yin · Kannan Ramchandran -
2020 Poster: Escaping the Gravitational Pull of Softmax »
Jincheng Mei · Chenjun Xiao · Bo Dai · Lihong Li · Csaba Szepesvari · Dale Schuurmans -
2020 Oral: Escaping the Gravitational Pull of Softmax »
Jincheng Mei · Chenjun Xiao · Bo Dai · Lihong Li · Csaba Szepesvari · Dale Schuurmans -
2020 Poster: CoinDICE: Off-Policy Confidence Interval Estimation »
Bo Dai · Ofir Nachum · Yinlam Chow · Lihong Li · Csaba Szepesvari · Dale Schuurmans -
2020 Poster: Off-Policy Evaluation via the Regularized Lagrangian »
Mengjiao Yang · Ofir Nachum · Bo Dai · Lihong Li · Dale Schuurmans -
2020 Spotlight: CoinDICE: Off-Policy Confidence Interval Estimation »
Bo Dai · Ofir Nachum · Yinlam Chow · Lihong Li · Csaba Szepesvari · Dale Schuurmans -
2019 Workshop: The Optimization Foundations of Reinforcement Learning »
Bo Dai · Niao He · Nicolas Le Roux · Lihong Li · Dale Schuurmans · Martha White -
2019 Poster: Surrogate Objectives for Batch Policy Optimization in One-step Decision Making »
Minmin Chen · Ramki Gummadi · Chris Harris · Dale Schuurmans -
2019 Poster: Exponential Family Estimation via Adversarial Dynamics Embedding »
Bo Dai · Zhen Liu · Hanjun Dai · Niao He · Arthur Gretton · Le Song · Dale Schuurmans -
2019 Poster: A Geometric Perspective on Optimal Representations for Reinforcement Learning »
Marc Bellemare · Will Dabney · Robert Dadashi · Adrien Ali Taiga · Pablo Samuel Castro · Nicolas Le Roux · Dale Schuurmans · Tor Lattimore · Clare Lyle -
2019 Poster: Off-Policy Evaluation via Off-Policy Classification »
Alexander Irpan · Kanishka Rao · Konstantinos Bousmalis · Chris Harris · Julian Ibarz · Sergey Levine -
2018 Poster: Non-delusional Q-learning and value-iteration »
Tyler Lu · Dale Schuurmans · Craig Boutilier -
2018 Oral: Non-delusional Q-learning and value-iteration »
Tyler Lu · Dale Schuurmans · Craig Boutilier -
2018 Poster: Data center cooling using model-predictive control »
Nevena Lazic · Craig Boutilier · Tyler Lu · Eehern Wong · Binz Roy · Moonkyung Ryu · Greg Imwalle -
2014 Workshop: Personalization: Methods and Applications »
Yisong Yue · Khalid El-Arini · Dilan Gorur -
2013 Workshop: What Difference Does Personalization Make? »
Dilan Gorur · Romer Rosales · Olivier Chapelle · Dorota Glowacka -
2009 Workshop: Nonparametric Bayes »
Dilan Gorur · Francois Caron · Yee Whye Teh · David B Dunson · Zoubin Ghahramani · Michael Jordan -
2009 Poster: Indian Buffet Processes with Power-law Behavior »
Yee Whye Teh · Dilan Gorur -
2009 Spotlight: Indian Buffet Processes with Power-law Behavior »
Yee Whye Teh · Dilan Gorur -
2008 Poster: Dependent Dirichlet Process Spike Sorting »
Jan Gasthaus · Frank Wood · Dilan Gorur · Yee Whye Teh -
2008 Poster: An Efficient Sequential Monte Carlo Algorithm for Coalescent Clustering »
Dilan Gorur · Yee Whye Teh