Timezone: »

The Optimization Foundations of Reinforcement Learning
Bo Dai · Niao He · Nicolas Le Roux · Lihong Li · Dale Schuurmans · Martha White

Sat Dec 14 08:00 AM -- 06:00 PM (PST) @ West Ballroom A
Event URL: https://optrl2019.github.io/ »

Interest in reinforcement learning (RL) has boomed with recent improvements in benchmark tasks that suggest the potential for a revolutionary advance in practical applications. Unfortunately, research in RL remains hampered by limited theoretical understanding, making the field overly reliant on empirical exploration with insufficient principles to guide future development. It is imperative to develop a stronger fundamental understanding of the success of recent RL methods, both to expand the useability of the methods and accelerate future deployment. Recently, fundamental concepts from optimization and control theory have provided a fresh perspective that has led to the development of sound RL algorithms with provable efficiency. The goal of this workshop is to catalyze the growing synergy between RL and optimization research, promoting a rational reconsideration of the foundational principles for reinforcement learning, and bridging the gap between theory and practice.

Sat 8:00 a.m. - 8:10 a.m.
Opening Remarks
Bo Dai, Niao He, Nicolas Le Roux, Lihong Li, Dale Schuurmans, Martha White
Sat 8:10 a.m. - 8:50 a.m.


Mengdi Wang
Sat 8:50 a.m. - 9:10 a.m.
Trust region policy optimization (TRPO) is a popular and empirically successful policy search algorithm in Reinforcement Learning (RL) in which a surrogate problem, that restricts consecutive policies to be `close' to one another, is iteratively solved. Nevertheless, TRPO has been considered a heuristic algorithm inspired by Conservative Policy Iteration (CPI). We show that the adaptive scaling mechanism used in TRPO is in fact the natural ``RL version" of traditional trust-region methods from convex analysis. We first analyze TRPO in the planning setting, in which we have access to the model and the entire state space. Then, we consider sample-based TRPO and establish $\tilde O(1/\sqrt{N})$ convergence rate to the global optimum. Importantly, the adaptive scaling mechanism allows us to analyze TRPO in regularized MDPs for which we prove fast rates of $\tilde O(1/N)$, much like results in convex optimization. This is the first result in RL of better rates when regularizing the instantaneous cost or reward.
Lior Shani, Yonathan Efroni, Shie Mannor
Sat 9:10 a.m. - 9:30 a.m.
Poster Spotlight 1 (Poster Spotlight)
David Brandfonbrener, Joan Bruna, Tom Zahavy, Haim Kaplan, Yishay Mansour, Nikos Karampatziakis, John Langford, Paul Mineiro, Donghwan Lee, Niao He
Sat 9:30 a.m. - 10:30 a.m.
Poster and Coffee Break 1 (Poster Session)
Aaron Sidford, Aditya Mahajan, Alejandro Ribeiro, Alex Lewandowski, Ali H Sayed, Ambuj Tewari, Angelika Steger, Anima Anandkumar, Asier Mujika, Hilbert J Kappen, Bolei Zhou, Byron Boots, Chelsea Finn, Chen-Yu Wei, Chi Jin, Ching-An Cheng, Christina Yu, Clement Gehring, Craig Boutilier, Dahua Lin, Daniel McNamee, Daniel Russo, David Brandfonbrener, Denny Zhou, Devesh Jha, Diego Romeres, Doina Precup, Dominik Thalmeier, Eduard Gorbunov, Elad Hazan, Elena Smirnova, Elvis Dohmatob, Emma Brunskill, Enrique Munoz de Cote, Ethan Waldie, Florian Meier, Florian Schaefer, Ge Liu, Gergely Neu, Haim Kaplan, Hao Sun, Hengshuai Yao, Jalaj Bhandari, James A Preiss, Jayakumar Subramanian, Jiajin Li, Jieping Ye, Jimmy Smith, Joan Bas Serrano, Joan Bruna, John Langford, Jonathan Lee, Jose A. Arjona-Medina, Kaiqing Zhang, Karan Singh, Yuping Luo, Zafarali Ahmed, Zaiwei Chen, Zhaoran Wang, zz Li, Zhuoran Yang, Ziping Xu, Ziyang Tang, Yi Mao, David Brandfonbrener, Shirli Di-Castro Shashua, Riashat Islam, Zuyue Fu, Abhishek Naik, Saurabh Kumar, Benjamin Petit, Angeliki Kamoutsi, Simone Totaro, Arvind Raghunathan, Rui Wu, Donghwan Lee, Dongsheng Ding, Alec Koppel, Hao Sun, Christian Tjandraatmadja, Mahdi Karami, Jincheng Mei, Chenjun Xiao, Junfeng Wen, Vincent Zhang, Ross Goroshin, Mohammad Pezeshki, Jiaqi Zhai, Philip Amortila, Shuo Huang, Mariya Vasileva, El houcine Bergou, Adel Ahmadyan, Haoran Sun, Sheng Zhang, Lukas Gruber, Yuanhao Wang, Tetiana Parshakova
Sat 10:30 a.m. - 11:10 a.m.

Reinforcement learning is now the dominant paradigm for how an agent learns to interact with the world in order to achieve some long term objectives. Here, policy gradient methods are among the most effective methods in challenging reinforcement learning problems, due to that they: are applicable to any differentiable policy parameterization; admit easy extensions to function approximation; easily incorporate structured state and action spaces; are easy to implement in a simulation based, model-free manner.

However, little is known about even their most basic theoretical convergence properties, including: - do they converge to a globally optimal solution, say with a sufficiently rich policy class? - how well do they cope with approximation error, say due to using a class of neural policies? - what is their finite sample complexity? This talk will survey a number of results on these basic questions. We will highlight the interplay of theory, algorithm design, and practice.

Joint work with: Alekh Agarwal, Jason Lee, Gaurav Mahajan

Sham Kakade
Sat 11:10 a.m. - 11:40 a.m.


Richard Sutton, Doina Precup
Sat 11:40 a.m. - 12:20 p.m.
Poster Spotlight 2 (Poster Spotlight)
Aaron Sidford, Mengdi Wang, Lin Yang, Yinyu Ye, Zuyue Fu, Zhuoran Yang, Yongxin Chen, Zhaoran Wang, Ofir Nachum, Bo Dai, Ilya Kostrikov, Dale Schuurmans, Ziyang Tang, Yihao Feng, Lihong Li, Denny Zhou, Qiang Liu, Rodrigo Toro Icarte, Ethan Waldie, Toryn Klassen, Rick Valenzano, Margarita Castro, Simon Du, Sham Kakade, Ruosong Wang, Minshuo Chen, Tianyi Liu, Xingguo Li, Zhaoran Wang, Tuo Zhao, Philip Amortila, Doina Precup, Prakash Panangaden, Marc Bellemare
Sat 2:00 p.m. - 2:40 p.m.

The reinforcement learning problem is often framed as one of quickly optimizing an uncertain Markov decision process. This formulation has led to substantial insight and progress in algorithms and theory. However, this perspective is limiting and can also give rise to poor algorithm designs. I will discuss this issue and how it is addressed by popular reinforcement learning algorithms.

Benjamin Van Roy
Sat 2:40 p.m. - 3:20 p.m.

We present a learning algorithm for the stochastic inventory control problem under lost sales penalty and positive lead times, when the demand distribution is a priori unknown. Our main result is a regret bound of O(L\sqrt{T}+D) for the algorithm, where T is the time horizon, L is the fixed and known lead time, and D is an unknown parameter of the demand distribution described roughly as the number of time steps needed to generate enough demand for depleting one unit of inventory. Our results significantly improve the existing regret bounds for this problem. Notably, even though the state space of the underlying Markov Decision Process (MDP) in this problem is continuous and L-dimensional, our regret bounds depend linearly on L. Our techniques utilize convexity of the long run average cost and a newly derived bound on the `bias' of base-stock policies, to establish an almost blackbox connection between the problem of learning and optimization in such MDPs and stochastic convex bandit optimization. The techniques presented here may be of independent interest for other settings that involve learning large structured MDPs but with convex cost functions.

Shipra Agrawal
Sat 3:20 p.m. - 4:20 p.m.
Poster and Coffee Break 2 (Poster Session)
Karol Hausman, Kefan Dong, Ken Goldberg, Lihong Li, Lin Yang, Lingxiao Wang, Lior Shani, Liwei Wang, Loren Amdahl-Culleton, Lucas Cassano, Marc Dymetman, Marc Bellemare, Marcin Tomczak, Margarita Castro, Marius Kloft, Marius-Constantin Dinu, Markus Holzleitner, Martha White, Mengdi Wang, Michael Jordan, Mihailo Jovanovic, Ming Yu, Minshuo Chen, Moonkyung Ryu, Muhammad Zaheer, Naman Agarwal, Nan Jiang, Niao He, Niko Yasui, Nikos Karampatziakis, Nino Vieillard, Ofir Nachum, Olivier Pietquin, Ozan Sener, Pan Xu, Parameswaran Kamalaruban, Paul Mineiro, Paul Rolland, Philip Amortila, Pierre-Luc Bacon, Prakash Panangaden, Qi Cai, Qiang Liu, Quanquan Gu, Raihan Seraj, Richard Sutton, Rick Valenzano, Robert Dadashi, Rodrigo Toro Icarte, Roshan Shariff, Roy Fox, Ruosong Wang, Saeed Ghadimi, Samuel Sokota, Sean Sinclair, Sepp Hochreiter, Sergey Levine, Sergio Valcarcel Macua, Sham Kakade, Shangtong Zhang, Sheila McIlraith, Shie Mannor, Shimon Whiteson, Shuai Li, Shuang Qiu, Sibon Li, Sid Banerjee, Sitao Luan, Tamer Basar, Thinh Doan, Tianhe (Kevin) Yu, Tianyi Liu, Tom Zahavy, Toryn Klassen, Tuo Zhao, Vicenç Gómez, Vincent Liu, Volkan Cevher, Wesley Suttle, Xiao-Wen Chang, Xiaohan Wei, Xiaotong Liu, Xingguo Li, Xinyi Chen, Xingyou Song, Yao Liu, YiDing Jiang, Yihao Feng, Yilun Du, Yinlam Chow, Yinyu Ye, Yishay Mansour, yixuan.lin.1, Yonathan Efroni, Yongxin Chen, Yuanhao Wang, Bo Dai, Chen-Yu Wei, Harsh Shrivastava, Hongyang Zhang, Qinqing Zheng, Satpathi SATPATHI, Xueqing Liu, Andreu Vall
Sat 4:20 p.m. - 5:00 p.m.
Gradient temporal-difference (GTD) algorithms are an important family of extensions of the standard TD algorithm, overcoming the stability issues in off-policy learning with function approximation by minimizing the projected Bellman error using stochastic gradient descent (SGD). In this talk, we consider GTD($\lambda$) for policy evaluation in the case of linear function approximation and finite-space Markov decision processes (MDPs) with discounted rewards. GTD($\lambda$) differs from the typical SGD algorithm for minimizing an objective function, in the following way. In an MDP, each stationary policy is associated with a family of Bellman equations. By selecting the values of the eligibility trace parameters, $\lambda$, we determine the Bellman operator that defines the objective function. This choice influences both the approximation error and learning behavior of the resulting algorithm. We first describe a dynamic scheme to judiciously set $\lambda$ in a history-dependent way. This scheme is based on the idea of randomized stopping times and generalized Bellman equations, and it is useful for balancing the bias-variance tradeoff in off-policy learning. We then present asymptotic convergence results for several variations of GTD($\lambda$), including saddle-point and mirror-descent TD variants, for different choices of $\lambda$: constant, state-dependent, and history-dependent. These convergence results are obtained by combining special properties of the joint state and eligibility trace process in TD learning with ergodic theory for weak Feller Markov chains, mean-ODE-based proof methods, and convex optimization theory. Our results not only resolve long-standing open questions regarding the convergence of GTD($\lambda$) but also provide the basis for using a broader class of generalized Bellman operators for approximate policy evaluation with linear TD methods.
Huizhen Yu
Sat 5:00 p.m. - 5:20 p.m.

Online learning is a powerful tool for analyzing iterative algorithms. However, the classic adversarial setup sometimes fails to capture certain regularity in online problems in practice. Motivated by this, we establish a new setup, called Continuous Online Learning (COL), where the gradient of online loss function changes continuously across rounds with respect to the learner’s decisions. We show that COL covers and more appropriately describes many interesting applications, from general equilibrium problems (EPs) to optimization in episodic MDPs. Using this new setup, we revisit the difficulty of achieving sublinear dynamic regret. We prove that there is a fundamental equivalence between achieving sublinear dynamic regret in COL and solving certain EPs, and we present a reduction from dynamic regret to both static regret and convergence rate of the associated EP. At the end, we specialize these new insights into online imitation learning and show improved understanding of its learning stability.

Jonathan Lee, Ching-An Cheng, Ken Goldberg, Byron Boots
Sat 5:20 p.m. - 5:40 p.m.

We study optimal regret bounds for control in linear dynamical systems under adversarially changing strongly convex cost functions, given the knowledge of transition dynamics. This includes several well studied and fundamental frameworks such as the Kalman filter and the linear quadratic regulator. State of the art methods achieve regret which scales as T^0.5, where T is the time horizon.

We show that the optimal regret in this setting can be significantly smaller, scaling as polylog T. This regret bound is achieved by two different efficient iterative methods, online gradient descent and online natural gradient.

Naman Agarwal, Elad Hazan, Karan Singh
Sat 5:40 p.m. - 6:00 p.m.

Awards Announcement

Bo Dai, Niao He, Nicolas Le Roux, Lihong Li, Dale Schuurmans, Martha White

Author Information

Bo Dai (Google Brain)
Niao He (UIUC)
Nicolas Le Roux (Google)
Lihong Li (Google Research)
Dale Schuurmans (Google Inc.)
Martha White (University of Alberta)

More from the Same Authors