## Orals & Spotlights Track 04: Reinforcement Learning

Each Oral includes Q&ASpotlights have joint Q&As

### Time

2020-12-07T18:00:00-08:00 - 2020-12-07T21:00:00-08:00### Session chairs

David Ha, Aleksandra Faust### Video

### Chat

##### To ask questions please use rocketchat, available only upon registration and login.

### Schedule

2020-12-07T18:00:00-08:00 - 2020-12-07T18:15:00-08:00

##### 1 - Oral: An Efficient Asynchronous Method for Integrating Evolutionary and Gradient-based Policy Search

Kyunghyun Lee, Byeong-Uk Lee, Ukcheol Shin, In So Kweon

Deep reinforcement learning (DRL) algorithms and evolution strategies (ES) have been applied to various tasks, showing excellent performances. These have the opposite properties, with DRL having good sample efficiency and poor stability, while ES being vice versa. Recently, there have been attempts to combine these algorithms, but these methods fully rely on synchronous update scheme, making it not ideal to maximize the benefits of the parallelism in ES. To solve this challenge, asynchronous update scheme was introduced, which is capable of good time-efficiency and diverse policy exploration. In this paper, we introduce an Asynchronous Evolution Strategy-Reinforcement Learning (AES-RL) that maximizes the parallel efficiency of ES and integrates it with policy gradient methods. Specifically, we propose 1) a novel framework to merge ES and DRL asynchronously and 2) various asynchronous update methods that can take all advantages of asynchronism, ES, and DRL, which are exploration and time efficiency, stability, and sample efficiency, respectively. The proposed framework and update methods are evaluated in continuous control benchmark work, showing superior performance as well as time efficiency compared to the previous methods.

2020-12-07T18:15:00-08:00 - 2020-12-07T18:30:00-08:00

##### 2 - Oral: Novelty Search in Representational Space for Sample Efficient Exploration

David Tao, Vincent Francois-Lavet, Joelle Pineau

We present a new approach for efficient exploration which leverages a low-dimensional encoding of the environment learned with a combination of model-based and model-free objectives. Our approach uses intrinsic rewards that are based on the distance of nearest neighbors in the low dimensional representational space to gauge novelty. We then leverage these intrinsic rewards for sample-efficient exploration with planning routines in representational space for hard exploration tasks with sparse rewards. One key element of our approach is the use of information theoretic principles to shape our representations in a way so that our novelty reward goes beyond pixel similarity. We test our approach on a number of maze tasks, as well as a control problem and show that our exploration approach is more sample-efficient compared to strong baselines.

2020-12-07T18:30:00-08:00 - 2020-12-07T18:45:00-08:00

##### 3 - Oral: Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design

Michael Dennis, Natasha Jaques, Eugene Vinitsky, Alexandre Bayen, Stuart Russell, Andrew Critch, Sergey Levine

A wide range of reinforcement learning (RL) problems --- including robustness, transfer learning, unsupervised RL, and emergent complexity --- require specifying a distribution of tasks or environments in which a policy will be trained. However, creating a useful distribution of environments is error prone, and takes a significant amount of developer time and effort. We propose Unsupervised Environment Design (UED) as an alternative paradigm, where developers provide environments with unknown parameters, and these parameters are used to automatically produce a distribution over valid, solvable environments. Existing approaches to automatically generating environments suffer from common failure modes: domain randomization cannot generate structure or adapt the difficulty of the environment to the agent's learning progress, and minimax adversarial training leads to worst-case environments that are often unsolvable. To generate structured, solvable environments for our protagonist agent, we introduce a second, antagonist agent that is allied with the environment-generating adversary. The adversary is motivated to generate environments which maximize regret, defined as the difference between the protagonist and antagonist agent's return. We call our technique Protagonist Antagonist Induced Regret Environment Design (PAIRED). Our experiments demonstrate that PAIRED produces a natural curriculum of increasingly complex environments, and PAIRED agents achieve higher zero-shot transfer performance when tested in highly novel environments.

2020-12-07T18:45:00-08:00 - 2020-12-07T19:00:00-08:00

##### Break

2020-12-07T19:00:00-08:00 - 2020-12-07T19:10:00-08:00

##### 5 - Spotlight: First Order Constrained Optimization in Policy Space

Yiming Zhang, Quan Vuong, Keith Ross

In reinforcement learning, an agent attempts to learn high-performing behaviors through interacting with the environment, such behaviors are often quantified in the form of a reward function. However some aspects of behavior—such as ones which are deemed unsafe and to be avoided—are best captured through constraints. We propose a novel approach called First Order Constrained Optimization in Policy Space (FOCOPS) which maximizes an agent's overall reward while ensuring the agent satisfies a set of cost constraints. Using data generated from the current policy, FOCOPS first finds the optimal update policy by solving a constrained optimization problem in the nonparameterized policy space. FOCOPS then projects the update policy back into the parametric policy space. Our approach has an approximate upper bound for worst-case constraint violation throughout training and is first-order in nature therefore simple to implement. We provide empirical evidence that our simple approach achieves better performance on a set of constrained robotics locomotive tasks.

2020-12-07T19:10:00-08:00 - 2020-12-07T19:20:00-08:00

##### 6 - Spotlight: CoinDICE: Off-Policy Confidence Interval Estimation

Bo Dai, Ofir Nachum, Yinlam Chow, Lihong Li, Csaba Szepesvari, Dale Schuurmans

We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning, where the goal is to estimate a confidence interval on a target policy's value, given only access to a static experience dataset collected by unknown behavior policies. Starting from a function space embedding of the linear program formulation of the Q-function, we obtain an optimization problem with generalized estimating equation constraints. By applying the generalized empirical likelihood method to the resulting Lagrangian, we propose CoinDICE, a novel and efficient algorithm for computing confidence intervals. Theoretically, we prove the obtained confidence intervals are valid, in both asymptotic and finite-sample regimes. Empirically, we show in a variety of benchmarks that the confidence interval estimates are tighter and more accurate than existing methods.

2020-12-07T19:20:00-08:00 - 2020-12-07T19:30:00-08:00

##### 7 - Spotlight: DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction

Aviral Kumar, Abhishek Gupta, Sergey Levine

Deep reinforcement learning can learn effective policies for a wide range of tasks, but is notoriously difficult to use due to instability and sensitivity to hyperparameters. The reasons for this remain unclear. In this paper, we study how RL methods based on bootstrapping-based Q-learning can suffer from a pathological interaction between function approximation and the data distribution used to train the Q-function: with standard supervised learning, online data collection should induce corrective feedback, where new data corrects mistakes in old predictions. With dynamic programming methods like Q-learning, such feedback may be absent. This can lead to potential instability, sub-optimal convergence, and poor results when learning from noisy, sparse or delayed rewards. Based on these observations, we propose a new algorithm, DisCor, which explicitly optimizes for data distributions that can correct for accumulated errors in the value function. DisCor computes a tractable approximation to the distribution that optimally induces corrective feedback, which we show results in reweighting samples based on the estimated accuracy of their target values. Using this distribution for training, DisCor results in substantial improvements in a range of challenging RL settings, such as multi-task learning and learning from noisy reward signals.

2020-12-07T19:30:00-08:00 - 2020-12-07T19:40:00-08:00

##### 8 - Spotlight: Risk-Sensitive Reinforcement Learning: Near-Optimal Risk-Sample Tradeoff in Regret

Yingjie Fei, Zhuoran Yang, Yudong Chen, Zhaoran Wang, Qiaomin Xie

We study risk-sensitive reinforcement learning in episodic Markov decision processes with unknown transition kernels, where the goal is to optimize the total reward under the risk measure of exponential utility. We propose two provably efficient model-free algorithms, Risk-Sensitive Value Iteration (RSVI) and Risk-Sensitive Q-learning (RSQ). These algorithms implement a form of risk-sensitive optimism in the face of uncertainty, which adapts to both risk-seeking and risk-averse modes of exploration. We prove that RSVI attains an \ensuremath{\tilde{O}\big(\lambda(|\beta| H^2) \cdot \sqrt{H^{3} S^{2}AT} \big)}$ regret, while RSQ attains an $\ensuremath{\tilde{O}\big(\lambda(|\beta| H^2) \cdot \sqrt{H^{4} SAT} \big)}$ regret, where $\lambda(u) = (e^{3u}-1)/u$ for $u>0$. In the above, $\beta$ is the risk parameter of the exponential utility function, $S$ the number of states, $A$ the number of actions, $T$ the total number of timesteps, and $H$ the episode length. On the flip side, we establish a regret lower bound showing that the exponential dependence on $|\beta|$ and $H$ is unavoidable for any algorithm with an $\tilde{O}(\sqrt{T})$ regret (even when the risk objective is on the same scale as the original reward), thus certifying the near-optimality of the proposed algorithms. Our results demonstrate that incorporating risk awareness into reinforcement learning necessitates an exponential cost in $|\beta|$ and $H$, which quantifies the fundamental tradeoff between risk sensitivity (related to aleatoric uncertainty) and sample efficiency (related to epistemic uncertainty). To the best of our knowledge, this is the first regret analysis of risk-sensitive reinforcement learning with the exponential utility.

2020-12-07T19:40:00-08:00 - 2020-12-07T19:50:00-08:00

##### Q&A: Joint Q&A for Preceeding Spotlights

2020-12-07T19:50:00-08:00 - 2020-12-07T20:00:00-08:00

##### 10 - Spotlight: Provably Efficient Exploration for Reinforcement Learning Using Unsupervised Learning

Fei Feng, Ruosong Wang, Wotao Yin, Simon Du, Lin Yang

Motivated by the prevailing paradigm of using unsupervised learning for efficient exploration in reinforcement learning (RL) problems [tang2017exploration,bellemare2016unifying], we investigate when this paradigm is provably efficient. We study episodic Markov decision processes with rich observations generated from a small number of latent states. We present a general algorithmic framework that is built upon two components: an unsupervised learning algorithm and a no-regret tabular RL algorithm. Theoretically, we prove that as long as the unsupervised learning algorithm enjoys a polynomial sample complexity guarantee, we can find a near-optimal policy with sample complexity polynomial in the number of latent states, which is significantly smaller than the number of observations. Empirically, we instantiate our framework on a class of hard exploration problems to demonstrate the practicality of our theory.

2020-12-07T20:00:00-08:00 - 2020-12-07T20:10:00-08:00

##### 11 - Spotlight: Bayesian Multi-type Mean Field Multi-agent Imitation Learning

Fan Yang, Alina Vereshchaka, Changyou Chen, Wen Dong

Multi-agent Imitation learning (MAIL) refers to the problem that agents learn to perform a task interactively in a multi-agent system through observing and mimicking expert demonstrations, without any knowledge of a reward function from the environment. MAIL has received a lot of attention due to promising results achieved on synthesized tasks, with the potential to be applied to complex real-world multi-agent tasks. Key challenges for MAIL include sample efficiency and scalability. In this paper, we proposed Bayesian multi-type mean field multi-agent imitation learning (BM3IL). Our method improves sample efficiency through establishing a Bayesian formulation for MAIL, and enhances scalability through introducing a new multi-type mean field approximation. We demonstrate the performance of our algorithm through benchmarking with three state-of-the-art multi-agent imitation learning algorithms on several tasks, including solving a multi-agent traffic optimization problem in a real-world transportation network. Experimental results indicate that our algorithm significantly outperforms all other algorithms in all scenarios.

2020-12-07T20:10:00-08:00 - 2020-12-07T20:20:00-08:00

##### 12 - Spotlight: Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity

Kaiqing Zhang, Sham Kakade, Tamer Basar, Lin Yang

Model-based reinforcement learning (RL), which finds an optimal policy using an empirical model, has long been recognized as one of the cornerstones of RL. It is especially suitable for multi-agent RL (MARL), as it naturally decouples the learning and the planning phases, and avoids the non-stationarity problem when all agents are improving their policies simultaneously using samples. Though intuitive and widely-used, the sample complexity of model-based MARL algorithms has been investigated relatively much less often. In this paper, we aim to address the fundamental open question about the sample complexity of model-based MARL. We study arguably the most basic MARL setting: two-player discounted zero-sum Markov games, given only access to a generative model of state transition. We show that model-based MARL achieves a sample complexity of $\tilde \cO(|\cS||\cA||\cB|(1-\gamma)^{-3}\epsilon^{-2})$ for finding the Nash equilibrium (NE) \emph{value} up to some $\epsilon$ error, and the $\epsilon$-NE \emph{policies}, where $\gamma$ is the discount factor, and $\cS,\cA,\cB$ denote the state space, and the action spaces for the two agents. We also show that this method is near-minimax optimal with a tight dependence on $1-\gamma$ and $|\cS|$ by providing a lower bound of $\Omega(|\cS|(|\cA|+|\cB|)(1-\gamma)^{-3}\epsilon^{-2})$. Our results justify the efficiency of this simple model-based approach in the multi-agent RL setting.

2020-12-07T20:20:00-08:00 - 2020-12-07T20:30:00-08:00

##### 13 - Spotlight: Safe Reinforcement Learning via Curriculum Induction

Matteo Turchetta, Andrey Kolobov, Shital Shah, Andreas Krause, Alekh Agarwal

In safety-critical applications, autonomous agents may need to learn in an environment where mistakes can be very costly. In such settings, the agent needs to behave safely not only after but also while learning. To achieve this, existing safe reinforcement learning methods make an agent rely on priors that let it avoid dangerous situations during exploration with high probability, but both the probabilistic guarantees and the smoothness assumptions inherent in the priors are not viable in many scenarios of interest such as autonomous driving. This paper presents an alternative approach inspired by human teaching, where an agent learns under the supervision of an automatic instructor that saves the agent from violating constraints during learning. In this model, we introduce the monitor that neither needs to know how to do well at the task the agent is learning nor needs to know how the environment works. Instead, it has a library of reset controllers that it activates when the agent starts behaving dangerously, preventing it from doing damage. Crucially, the choices of which reset controller to apply in which situation affect the speed of agent learning. Based on observing agents' progress the teacher itself learns a policy for choosing the reset controllers, a curriculum, to optimize the agent's final policy reward. Our experiments use this framework in two environments to induce curricula for safe and efficient learning.

2020-12-07T20:30:00-08:00 - 2020-12-07T20:40:00-08:00

##### Q&A: Joint Q&A for Preceeding Spotlights

2020-12-07T20:40:00-08:00 - 2020-12-07T21:00:00-08:00