Timezone: »
Recently, a novel class of Approximate Policy Iteration (API) algorithms have demonstrated impressive practical performance (e.g., ExIt from [1], AlphaGo-Zero from [2]). This new family of algorithms maintains, and alternately optimizes, two policies: a fast, reactive policy (e.g., a deep neural network) deployed at test time, and a slow, non-reactive policy (e.g., Tree Search), that can plan multiple steps ahead. The reactive policy is updated under supervision from the non-reactive policy, while the non-reactive policy is improved with guidance from the reactive policy. In this work we study this Dual Policy Iteration (DPI) strategy in an alternating optimization framework and provide a convergence analysis that extends existing API theory. We also develop a special instance of this framework which reduces the update of non-reactive policies to model-based optimal control using learned local models, and provides a theoretically sound way of unifying model-free and model-based RL approaches with unknown dynamics. We demonstrate the efficacy of our approach on various continuous control Markov Decision Processes.
Author Information
Wen Sun (Carnegie Mellon University)
Geoffrey Gordon (MSR Montréal & CMU)
Dr. Gordon is an Associate Research Professor in the Department of Machine Learning at Carnegie Mellon University, and co-director of the Department's Ph. D. program. He works on multi-robot systems, statistical machine learning, game theory, and planning in probabilistic, adversarial, and general-sum domains. His previous appointments include Visiting Professor at the Stanford Computer Science Department and Principal Scientist at Burning Glass Technologies in San Diego. Dr. Gordon received his B.A. in Computer Science from Cornell University in 1991, and his Ph.D. in Computer Science from Carnegie Mellon University in 1999.
Byron Boots (Georgia Tech / Google Brain)
J. Bagnell (Carnegie Mellon University)
More from the Same Authors
-
2020 Poster: Trade-offs and Guarantees of Adversarial Representation Learning for Information Obfuscation »
Han Zhao · Jianfeng Chi · Yuan Tian · Geoffrey Gordon -
2020 Poster: Domain Adaptation with Conditional Distribution Matching and Generalized Label Shift »
Remi Tachet des Combes · Han Zhao · Yu-Xiang Wang · Geoffrey Gordon -
2020 Poster: FLAMBE: Structural Complexity and Representation Learning of Low Rank MDPs »
Alekh Agarwal · Sham Kakade · Akshay Krishnamurthy · Wen Sun -
2020 Poster: PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient Learning »
Alekh Agarwal · Mikael Henaff · Sham Kakade · Wen Sun -
2020 Poster: Multi-Robot Collision Avoidance under Uncertainty with Probabilistic Safety Barrier Certificates »
Wenhao Luo · Wen Sun · Ashish Kapoor -
2020 Spotlight: Multi-Robot Collision Avoidance under Uncertainty with Probabilistic Safety Barrier Certificates »
Wenhao Luo · Wen Sun · Ashish Kapoor -
2020 Oral: FLAMBE: Structural Complexity and Representation Learning of Low Rank MDPs »
Alekh Agarwal · Sham Kakade · Akshay Krishnamurthy · Wen Sun -
2020 Poster: Information Theoretic Regret Bounds for Online Nonlinear Control »
Sham Kakade · Akshay Krishnamurthy · Kendall Lowrey · Motoya Ohnishi · Wen Sun -
2019 Poster: Learning Neural Networks with Adaptive Regularization »
Han Zhao · Yao-Hung Hubert Tsai · Russ Salakhutdinov · Geoffrey Gordon -
2019 Poster: Towards modular and programmable architecture search »
Renato Negrinho · Matthew Gormley · Geoffrey Gordon · Darshan Patil · Nghia Le · Daniel Ferreira -
2018 Poster: Differentiable MPC for End-to-end Planning and Control »
Brandon Amos · Ivan Jimenez · Jacob I Sacks · Byron Boots · J. Zico Kolter -
2018 Poster: Learning Beam Search Policies via Imitation Learning »
Renato Negrinho · Matthew Gormley · Geoffrey Gordon -
2018 Poster: Learning and Inference in Hilbert Space with Quantum Graphical Models »
Siddarth Srinivasan · Carlton Downey · Byron Boots -
2018 Poster: Orthogonally Decoupled Variational Gaussian Processes »
Hugh Salimbeni · Ching-An Cheng · Byron Boots · Marc Deisenroth (he/him) -
2018 Poster: Adversarial Multiple Source Domain Adaptation »
Han Zhao · Shanghang Zhang · Guanhang Wu · José M. F. Moura · Joao P Costeira · Geoffrey Gordon -
2017 Poster: Linear Time Computation of Moments in Sum-Product Networks »
Han Zhao · Geoffrey Gordon -
2017 Poster: Predictive-State Decoders: Encoding the Future into Recurrent Networks »
Arun Venkatraman · Nicholas Rhinehart · Wen Sun · Lerrel Pinto · Martial Hebert · Byron Boots · Kris Kitani · J. Bagnell -
2017 Poster: Variational Inference for Gaussian Process Models with Linear Complexity »
Ching-An Cheng · Byron Boots -
2017 Poster: Predictive State Recurrent Neural Networks »
Carlton Downey · Ahmed Hefny · Byron Boots · Geoffrey Gordon · Boyue Li -
2016 Poster: A Unified Approach for Learning the Parameters of Sum-Product Networks »
Han Zhao · Pascal Poupart · Geoffrey Gordon -
2016 Poster: Incremental Variational Sparse Gaussian Process Regression »
Ching-An Cheng · Byron Boots -
2015 Poster: Supervised Learning for Dynamical System Learning »
Ahmed Hefny · Carlton Downey · Geoffrey Gordon -
2014 Session: Oral Session 7 »
Geoffrey Gordon -
2012 Tutorial: Machine Learning for Student Learning »
Emma Brunskill · Geoffrey Gordon -
2010 Poster: Predictive State Temporal Difference Learning »
Byron Boots · Geoffrey Gordon -
2007 Oral: A Constraint Generation Approach to Learning Stable Linear Dynamical Systems »
Sajid M Siddiqi · Byron Boots · Geoffrey Gordon -
2007 Poster: A Constraint Generation Approach to Learning Stable Linear Dynamical Systems »
Sajid M Siddiqi · Byron Boots · Geoffrey Gordon -
2006 Poster: No-regret algorithms for Online Convex Programs »
Geoffrey Gordon -
2006 Talk: No-regret algorithms for Online Convex Programs »
Geoffrey Gordon -
2006 Poster: Multi-Robot Negotiation: Approximating the Set of Subgame Perfect Equilibria in General Sum Stochastic Games »
Chris D Murray · Geoffrey Gordon