Timezone: »

 
Workshop
Deep Reinforcement Learning
Pieter Abbeel · Chelsea Finn · Joelle Pineau · David Silver · Satinder Singh · Coline Devin · Misha Laskin · Kimin Lee · Janarthanan Rajendran · Vivek Veeriah

Fri Dec 11 08:30 AM -- 07:00 PM (PST) @ None
Event URL: https://sites.google.com/view/deep-rl-workshop-neurips2020/home »

In recent years, the use of deep neural networks as function approximators has enabled researchers to extend reinforcement learning techniques to solve increasingly complex control tasks. The emerging field of deep reinforcement learning has led to remarkable empirical results in rich and varied domains like robotics, strategy games, and multiagent interactions. This workshop will bring together researchers working at the intersection of deep learning and reinforcement learning, and it will help interested researchers outside of the field gain a high-level view about the current state of the art and potential directions for future contributions.

Fri 8:30 a.m. - 9:00 a.m. [iCal]
Invited talk: PierreYves Oudeyer (Talk)
Pierre-Yves Oudeyer
Fri 9:00 a.m. - 9:15 a.m. [iCal]
Contributed Talk: Learning Functionally Decomposed Hierarchies for Continuous Control Tasks with Path Planning (Talk)
Emre Aksan, Otmar Hilliges
Fri 9:15 a.m. - 9:30 a.m. [iCal]
Contributed Talk: Maximum Reward Formulation In Reinforcement Learning (Talk)
Sai Krishna Gottipati, Sahir .
Fri 9:30 a.m. - 9:45 a.m. [iCal]
Contributed Talk: Accelerating Reinforcement Learning with Learned Skill Priors (Talk)
Karl Pertsch, Youngwoon Lee, Joseph Lim
Fri 9:45 a.m. - 10:00 a.m. [iCal]
Contributed Talk: Asymmetric self-play for automatic goal discovery in robotic manipulation (Talk)
Lilian Weng, Arthur Petron, Wojciech Zaremba, Peter Welinder
Fri 10:00 a.m. - 10:30 a.m. [iCal]
Invited talk: Marc Bellemare (Talk)
Marc Bellemare
Fri 10:30 a.m. - 11:00 a.m. [iCal]
Break
Fri 11:00 a.m. - 11:30 a.m. [iCal]

For autonomous robots to operate in the open, dynamically changing world, they will need to be able to learn a robust set of skills from relatively little experience. This talk introduces Grounded Simulation Learning as a way to bridge the so-called reality gap between simulators and the real world in order to enable transfer learning from simulation to a real robot. Grounded Simulation Learning has led to the fastest known stable walk on a widely used humanoid robot. Connections to theoretical advances in off-policy reinforcement learning will be highlighted.

Peter Stone
Fri 11:30 a.m. - 11:45 a.m. [iCal]
Contributed Talk: Mirror Descent Policy Optimization (Talk)
Manan Tomar, Lior Shani, Yonathan Efroni
Fri 11:45 a.m. - 12:00 p.m. [iCal]
Contributed Talk: Planning from Pixels using Inverse Dynamics Models (Talk)
Sheila McIlraith, Jimmy Ba
Fri 12:00 p.m. - 12:30 p.m. [iCal]
Invited talk: Matt Botvinick (Talk)
Matt Botvinick
Fri 12:30 p.m. - 1:30 p.m. [iCal]
Poster session 1 (Poster session)
Fri 1:30 p.m. - 2:00 p.m. [iCal]

Digital Healthcare is a growing area of importance in modern healthcare due to its potential in helping individuals improve their behaviors so as to better manage chronic health challenges such as hypertension, mental health, cancer and so on. Digital apps and wearables, observe the user's state via sensors/self-report, deliver treatment actions (reminders, motivational messages, suggestions, social outreach,...) and observe rewards repeatedly on the user across time. This area is seeing increasing interest by RL researchers with the goal of including in the digital app/wearable an RL algorithm that "personalizes" the treatments to the user. But after RL is run on a number of users, how do we know whether the RL algorithm actually personalized the sequential treatments to the user? In this talk we report on our first efforts to address this question after our RL algorithm was deployed on each of 111 individuals with hypertension.

Susan Murphy
Fri 2:00 p.m. - 2:15 p.m. [iCal]
Contributed Talk: MaxEnt RL and Robust Control (Talk)
Benjamin Eysenbach, Sergey Levine
Fri 2:15 p.m. - 2:30 p.m. [iCal]
Contributed Talk: Reset-Free Lifelong Learning with Skill-Space Planning (Talk)
Kevin Lu, Aditya Grover, Pieter Abbeel, Igor Mordatch
Fri 2:30 p.m. - 3:00 p.m. [iCal]

Deep learning has shown promising results in robotics, but we are still far from having intelligent systems that can operate in the unstructured settings of the real world, where disturbances, variations, and unobserved factors lead to a dynamic environment. In this talk, we'll see that model-based deep RL can indeed allow for efficient skill acquisition, as well as the ability to repurpose models to solve a variety of tasks. We'll scale up these approaches to enable locomotion with a 6-DoF legged robot on varying terrains in the real world, as well as dexterous manipulation with a 24-DoF anthropomorphic hand in the real world. We then focus on the inevitable mismatch between an agent's training conditions and the test conditions in which it may actually be deployed, thus illuminating the need for adaptive systems. Inspired by the ability of humans and animals to adapt quickly in the face of unexpected changes, we present a meta-learning algorithm within this model-based RL framework to enable online adaptation of large, high-capacity models using only small amounts of data from the new task. These fast adaptation capabilities are seen in both simulation and the real-world, with experiments such as a 6-legged robot adapting online to an unexpected payload or suddenly losing a leg. We will then further extend the capabilities of our robotic systems by enabling the agents to reason directly from raw image observations. Bridging the benefits of representation learning techniques with the adaptation capabilities of meta-RL, we'll present a unified framework for effective meta-RL from images. With robotic arms in the real world that learn peg insertion and ethernet cable insertion to varying targets, we'll see the fast acquisition of new skills, directly from raw image observations in the real world. Finally, this talk will conclude that model-based deep RL provides a framework for making sense of the world, thus allowing for reasoning and adaptation capabilities that are necessary for successful operation in the dynamic settings of the real world.

Anusha Nagabandi
Fri 3:00 p.m. - 3:30 p.m. [iCal]
Break
Fri 3:30 p.m. - 4:00 p.m. [iCal]

A common trope in sci-fi is to have a robot that can quickly solve some problem after watching a person, studying a video, or reading a book. While these settings are (currently) fictional, the benefits are real. Agents that can solve tasks by observing others have the potential to greatly reduce the burden of their human teachers, removing some of the need to hand-specify rewards or goals. In this talk, I consider the question of how an agent can not only learn by observing others, but also how it can learn quickly by training offline before taking any steps in the environment. First, I will describe an approach that trains a latent policy directly from state observations, which can then be quickly mapped to real actions in the agent’s environment. Then I will describe how we can train a novel value function, Q(s,s’), to learn off-policy from observations. Unlike previous imitation from observation approaches, this formulation goes beyond simply imitating and rather enables learning from potentially suboptimal observations.

Ashley Edwards
Fri 4:00 p.m. - 4:07 p.m. [iCal]
NeurIPS RL Competitions: Flatland challenge (Talk)
Sharada Mohanty, Florian Laurent, Erik Nygren
Fri 4:07 p.m. - 4:15 p.m. [iCal]
NeurIPS RL Competitions: Learning to run a power network (Talk)
Antoine Marot
Fri 4:15 p.m. - 4:22 p.m. [iCal]
NeurIPS RL Competitions: Procgen challenge (Talk)
Karl Cobbe, Sharada Mohanty
Fri 4:22 p.m. - 4:30 p.m. [iCal]
NeurIPS RL Competitions: MineRL (Talk)
Stephanie Milani
Fri 4:30 p.m. - 5:00 p.m. [iCal]

Creating realistic virtual humans has traditionally been considered a research problem in Computer Animation primarily for entertainment applications. With the recent breakthrough in collaborative robots and deep reinforcement learning, accurately modeling human movements and behaviors has become a common challenge also faced by researchers in robotics and artificial intelligence. For example, mobile robots and autonomous vehicles can benefit from training in environments populated with ambulating humans and learning to avoid colliding with them. Healthcare robotics, on the other hand, need to embrace physical contacts and learn to utilize them for enabling human’s activities of daily living. An immediate concern in developing such an autonomous and powered robotic device is the safety of human users during the early development phase when the control policies are still largely suboptimal. Learning from physically simulated humans and environments presents a promising alternative which enables robots to safely make and learn from mistakes without putting real people at risk. However, deploying such policies to interact with people in the real world adds additional complexity to the already challenging sim-to-real transfer problem. In this talk, I will present our current progress on solving the problem of sim-to-real transfer with humans in the environment, actively interacting with the robots through physical contacts. We tackle the problem from two fronts: developing more relevant human models to facilitate robot learning and developing human-aware robot perception and control policies. As an example of contextualizing our research effort, we develop a mobile manipulator to put clothes on people with physical impairments, enabling them to carry out day-to-day tasks and maintain independence.

Karen Liu
Fri 5:00 p.m. - 6:00 p.m. [iCal]
Panel discussion
Fri 6:00 p.m. - 7:00 p.m. [iCal]
Poster session 2 (Poster session)
-
[ Video ]

Learning task-agnostic dynamics models in high-dimensional observation spaces can be challenging for model-based RL agents. We propose a novel way to learn latent world models by learning to predict sequences of future actions conditioned on task completion. These task-conditioned models adaptively focus modeling capacity on task-relevant dynamics, while simultaneously serving as an effective heuristic for planning with sparse rewards. We evaluate our method on challenging visual goal completion tasks and show a substantial increase in performance compared to prior model-free approaches.

Sheila McIlraith, Jimmy Ba
-
[ Video ]

Reinforcement learning (RL) has achieved impressive performance in a variety of online settings in which an agent's ability to query the environment for transitions and rewards is effectively unlimited. However, in many practical applications, the situation is reversed: an agent may have access to large amounts of undirected offline experience data, while access to the online environment is severely limited. In this work, we focus on this offline setting. Our main insight is that, when presented with offline data composed of a variety of behaviors, an effective way to leverage this data is to extract a continuous space of recurring and temporally extended primitive behaviors before using these primitives for downstream task learning. Primitives extracted in this way serve two purposes: they delineate the behaviors that are supported by the data from those that are not, making them useful for avoiding distributional shift in offline RL; and they provide a degree of temporal abstraction, which reduces the effective horizon yielding better learning in theory, and improved offline RL in practice. In addition to benefiting offline policy optimization, we show that performing offline primitive learning in this way can also be leveraged for improving few-shot imitation learning as well as exploration and transfer in online RL on a variety of benchmark domains. Visualizations are available at https://sites.google.com/view/opal-iclr.

Anurag Ajay, Aviral Kumar, Pulkit Agrawal, Ofir Nachum
-
Poster: Maximum Reward Formulation In Reinforcement Learning (Poster) [ Video ]
Sai Krishna Gottipati, Sahir ., Ravi Chunduru, Ahmed Touati
-
Poster: Reset-Free Lifelong Learning with Skill-Space Planning (Poster) [ Video ]
Kevin Lu, Aditya Grover, Pieter Abbeel, Igor Mordatch
-
Poster: Mirror Descent Policy Optimization (Poster) [ Video ]
Manan Tomar, Lior Shani, Yonathan Efroni
-
Poster: MaxEnt RL and Robust Control (Poster) [ Video ]
Benjamin Eysenbach, Sergey Levine
-
Poster: Learning Functionally Decomposed Hierarchies for Continuous Control Tasks with Path Planning (Poster) [ Video ]
Emre Aksan, Otmar Hilliges
-
[ Video ]
Policy Optimization (PO) methods with function approximation are one of the most popular classes of Reinforcement Learning (RL) algorithms. Despite their popularity, it largely remains a challenge to design provably efficient policy optimization algorithms. There have been some recent works that incorporated upper confidence bound (UCB)-style bonus to drive exploration in policy optimization algorithms. However, it still remains elusive how to design provably efficient policy optimization algorithm using Thompson sampling based exploration strategy. This paper presents a provably efficient policy optimization algorithm that incorporates exploration using Thompson sampling. We prove that, in an episodic linear MDP setting, our algorithm, Thompson Sampling for Policy Optimization (TSPO) achieves $\Tilde{\mathcal{O}}(d^{3/2} H^{3/2} \sqrt{T})$ worst-case (frequentist) regret, where $H$ is the length of each episode, $T$ is the total number of steps and $d$ is the number of features. Finally, we empirically evaluate TSPO and show that it is competitive with state-of-the-art baselines.
Zhuoran Yang, Andrei Lupu, Viet Nguyen, Riashat Islam, Doina Precup, Zhaoran Wang
-
Poster: Weighted Bellman Backups for Improved Signal-to-Noise in Q-Updates (Poster) [ Video ]
Kimin Lee
-
[ Video ]

Reinforcement learning from self-play has recently reported many successes. Self-play, where the agents compete with themselves, is often used to generate training data for iterative policy improvement. In previous work, heuristic rules are designed to choose an opponent for the current learner. Typical rules include choosing the latest agent, the best agent, or a random historical agent. However, these rules may be inefficient in practice and sometimes do not guarantee convergence even in the simplest matrix games. This paper proposes a new algorithmic framework for competitive self-play reinforcement learning in two-player zero-sum games. We recognize the fact that the Nash equilibrium coincides with the saddle point of the stochastic payoff function, which motivates us to borrow ideas from classical saddle point optimization literature. Our method simultaneously trains several agents and intelligently takes each other as opponents based on a simple adversarial rule derived from a principled perturbation-based saddle optimization method. We prove theoretically that our algorithm converges to an approximate equilibrium with high probability in convex-concave games under standard assumptions. Beyond the theory, we further show the empirical superiority of our method over baseline methods relying on the aforementioned opponent-selection heuristics in matrix games, grid-world soccer, Gomoku, and simulated robot sumo, with neural net policy function approximators.

Yuanyi Zhong, Yuan Zhou, Jian Peng
-
Poster: Asymmetric self-play for automatic goal discovery in robotic manipulation (Poster) [ Video ]
Lilian Weng, Peter Welinder, Arthur Petron, Wojciech Zaremba
-
Poster: Correcting Momentum in Temporal Difference Learning (Poster) [ Video ]
Emmanuel Bengio, Joelle Pineau, Doina Precup
-
Poster: Decoupling Exploration and Exploitation in Meta-Reinforcement Learning without Sacrifices (Poster) [ Video ]
Evan Liu, Percy Liang, Chelsea Finn
-
[ Video ]

In this paper, we study the problem of autonomously discovering temporally abstracted actions, or options, for exploration in reinforcement learning. For learning diverse options suitable for exploration, we introduce the infomax termination objective defined as the mutual information between options and their corresponding state transitions. We derive a scalable optimization scheme for maximizing this objective via the termination condition of options, yielding the InfoMax Option Critic (IMOC) algorithm. Through illustrative experiments, we empirically show that IMOC learns diverse options and utilizes them for exploration. Moreover, we show that IMOC scales well to continuous control tasks.

-
Poster: Model-Based Meta-Reinforcement Learning for Flight with Suspended Payloads (Poster) [ Video ]
Suneel Belkhale
-
Poster: Parrot: Data-driven Behavioral Priors for Reinforcement Learning (Poster) [ Video ]
Avi Singh, Nick Rhinehart, Sergey Levine
-
Poster: C-Learning: Horizon-Aware Cumulative Accessibility Estimation (Poster) [ Video ]
Gabriel Loaiza-Ganem, Animesh Garg
-
[ Video ]

We identify a fundamental implicit under-parameterization phenomenon in value-based deep RL methods that use bootstrapping: when value functions, approximated using deep neural networks, are trained with gradient descent using iterated regression onto target values generated by previous instances of the value network, more gradient updates decrease the expressivity of the current value network. We characterize this loss of expressivity via a rank collapse of the learned value network features and show that it corresponds to a drop in performance. We demonstrate this phenomenon on popular domains including Atari and Gym benchmarks and in both offline and online RL settings. We formally analyze this phenomenon and show that it results from a pathological interaction between bootstrapping and gradient-based optimization. Finally, we show that mitigating implicit under- parameterization by controlling rank collapse improves performance.

Aviral Kumar, Rishabh Agarwal, Dibya Ghosh, Sergey Levine
-
[ Video ]

While deep reinforcement learning excels at solving tasks where large amounts of data can be collected through virtually unlimited interaction with the environment, learning from limited interaction remains a key challenge. We posit that an agent can learn more efficiently if we augment reward maximization with self-supervised objectives based on structure in its visual input and sequential interaction with the environment. Our method, Self-Predictive Representations (SPR), trains an agent to predict its own latent state representations multiple steps into the future. We compute target representations for future states using an encoder which is an exponential moving average of the agent's parameters and we make predictions using a learned transition model. On its own, this future prediction objective outperforms prior methods for sample-efficient deep RL from pixels. We further improve performance by adding data augmentation to the future prediction loss, which forces the agent's representations to be consistent across multiple views of an observation. Our full self-supervised objective, which combines future prediction and data augmentation, achieves a median human-normalized score of 0.415 on Atari in a setting limited to 100k steps of environment interaction, which represents a 55% relative improvement over the previous state-of-the-art. Notably, even in this limited data regime, SPR exceeds expert human scores on 7 out of 26 games. We've made the code associated with this work available at https://anonymous.4open.science/r/b4b93ec6-6e5d-4f43-9b53-54bdf73bea95/

Max Schwarzer, Rishab Goel, R Devon Hjelm, Aaron Courville, Ankesh Anand
-
Poster: Accelerating Reinforcement Learning with Learned Skill Priors (Poster) [ Video ]
Karl Pertsch, Youngwoon Lee, Joseph Lim
-
[ Video ]

We study the problem of predicting and controlling the future state distribution of an autonomous agent. This problem, which can be viewed as a reframing of goal-conditioned reinforcement learning (RL), is centered around learning a conditional probability density function over future states. Instead of directly estimating this density function, we indirectly estimate this density function by training a classifier to predict whether an observation comes from the future. Via Bayes' rule, predictions from our classifier can be transformed into predictions over future states. Importantly, an off-policy variant of our algorithm allows us to predict the future state distribution of a new policy, without collecting new experience. This variant allows us to optimize functionals of a policy's future state distribution, such as the density of reaching a particular goal state. While conceptually similar to Q-learning, our work lays a principled foundation for goal-conditioned RL as density estimation, providing justification for goal-conditioned methods used in prior work. This foundation makes hypotheses about Q-learning, including the optimal goal-sampling ratio, which we confirm experimentally. Moreover, our proposed method is competitive with prior goal-conditioned RL methods.

Benjamin Eysenbach, Ruslan Salakhutdinov, Sergey Levine
-
[ Video ]

We propose a simple, practical, and intuitive approach for domain adaptation in reinforcement learning. Our approach stems from the idea that the agent's experience in the source domain should look similar to its experience in the target domain. Building off of a probabilistic view of RL, we achieve this goal by compensating for the difference in dynamics by modifying the reward function. This modified reward function is simple to estimate by learning auxiliary classifiers that distinguish source-domain transitions from target-domain transitions. Intuitively, the agent is penalized for transitions that would indicate that the agent is interacting with the source domain, rather than the target domain. Formally, we prove that applying our method in the source domain is guaranteed to obtain a near-optimal policy for the target domain, provided that the source and target domains satisfy a lightweight assumption. Our approach is applicable to domains with continuous states and actions and does not require learning an explicit model of the dynamics. On discrete and continuous control tasks, we illustrate the mechanics of our approach and demonstrate its scalability to high-dimensional tasks.

Benjamin Eysenbach, Swapnil Asawa, Ruslan Salakhutdinov, Sergey Levine
-
[ Video ]

Current reinforcement learning (RL) algorithms can be brittle and difficult to use, especially when learning goal-reaching behaviors from sparse rewards. Although supervised imitation learning provides a simple and stable alternative, it requires access to demonstrations from a human supervisor. In this paper, we study RL algorithms that use imitation learning to acquire goal reaching policies from scratch, without the need for expert demonstrations or a value function. In lieu of demonstrations, we leverage the property that any trajectory is a successful demonstration for reaching the final state in that same trajectory. We propose a simple algorithm in which an agent continually relabels and imitates the trajectories it generates to progressively learn goal-reaching behaviors from scratch. Each iteration, the agent collects new trajectories using the latest policy, and maximizes the likelihood of the actions along these trajectories under the goal that was actually reached, so as to improve the policy. We formally show that this iterated supervised learning procedure optimizes a bound on the RL objective, derive performance bounds of the learned policy, and empirically demonstrate improved goal-reaching performance and robustness over current RL algorithms in several benchmark tasks.

Dibya Ghosh, Abhishek Gupta, Justin Fu, Coline Devin, Benjamin Eysenbach, Sergey Levine
-
[ Video ]

Interpreting reinforcement learning as a probabilistic inference provides us many kinds of algorithms, such as MPO, AWR , and SAC. While these algorithms have a general formulation and similar derivation, their empirical performances have large gaps among them. In this paper, we organize a view of inference-based methods as unified policy iteration methods and evaluate these popular off-policy algorithms on the standard continuous control benchmarks. To reveal the source of the performance gap, we perform various ablation studies decoupling effects of various implementation tricks, such as clipped double Q and tanh action squashing, from those of algorithmic differences. Our empirical observation suggests that these combinations of algorithms and tricks play dominant roles.

Tadashi Kozuno, Tatsuya Matsushima, Yutaka Matsuo, Shixiang (Shane) Gu
-
[ Video ]

Experience replay, which enables the agents to remember and reuse experience from the past, has played a significant role in the success of off-policy reinforcement learning (RL). To utilize the experience replay efficiently, the existing sampling methods allow selecting out more meaningful experiences by imposing priorities on them based on certain metrics (e.g. TD-error). However, they may result in sampling highly biased, redundant transitions since they compute the sampling rate for each transition independently, without consideration of its importance in relation to other transitions. In this paper, we aim to address the issue by proposing a new learning-based sampling method that can compute the relative importance of transition. To this end, we design a novel permutation-equivariant neural architecture that takes contexts from not only features of each transition (local) but also those of others (global) as inputs. We validate our framework, which we refer to as Neural Experience Replay Sampler (NERS), on multiple benchmark tasks for both continuous and discrete control tasks and show that it can significantly improve the performance of various off-policy RL methods. Further analysis confirms that the improvements of the sample efficiency indeed are due to sampling diverse and meaningful transitions by NERS that considers both local and global contexts.

Kimin Lee, Jinwoo Shin, Eunho Yang, Sung Ju Hwang
-
[ Video ]

Learning to autonomously navigate the web is a difficult sequential decision-making task. The state and action spaces are large and combinatorial in nature, and websites are dynamic environments consisting of several pages. One of the bottlenecks of training web navigation agents is providing a learnable curriculum of training environments that can cover the large variety of real-world websites. Therefore, we propose using Adversarial Environment Generation (AEG) to generate challenging web environments in which to train reinforcement learning (RL) agents. We provide a new benchmarking environment, gMiniWoB, which enables an RL adversary to use compositional primitives to learn to generate arbitrarily complex websites. To train the adversary, we propose a new technique for maximizing regret using the difference in the scores obtained by a pair of navigator agents. Our results show that our approach significantly outperforms prior methods for minimax regret AEG. The regret objective trains the adversary to design a curriculum of environments that are “just-the-right-challenge” for the navigator agents; our results show that over time, the adversary learns to generate increasingly complex web navigation tasks. The navigator agents trained with our technique learn to complete challenging, high-dimensional web navigation tasks, such as form filling, booking a flight etc. We show that the navigator agent trained with our proposed Flexible b-PAIRED technique significantly outperforms competitive automatic curriculum generation baselines—including a state-of-the-art RL web navigation approach — on a set of challenging unseen test environments, and achieves more than 80% success rate on some tasks.

Aleksandra Faust, Honglak Lee
-
[ Video ]

First-person object-interaction tasks in high-fidelity, 3D, simulated environments such as the AI2Thor virtual home-environment pose significant sample-efficiency challenges for reinforcement learning (RL) agents learning from sparse task rewards. To alleviate these challenges, prior work has provided extensive supervision via a combination of reward-shaping, ground-truth object-information, and expert demonstrations. In this work, we show that one can learn object-interaction tasks from scratch without supervision by learning an attentive object-model as an auxiliary task during task learning with an object-centric relational RL agent. Our key insight is that learning an object-model that incorporates object-relationships into forward prediction provides a dense learning signal for unsupervised representation learning of both objects and their relationships. This, in turn, enables faster policy learning for an object-centric relational RL agent. We demonstrate our agent by introducing a set of challenging object-interaction tasks in the AI2Thor environment where learning with our attentive object-model is key to strong performance. Specifically, we compare our agent and relational RL agents with alternative auxiliary tasks to a relational RL agent equipped with ground-truth object-information, and show that learning with our object-model best closes the performance gap in terms of both learning speed and maximum success rate. Additionally, we find that incorporating object-attention into an object-model's forward predictions is key to learning representations which capture object-category and object-state.

Wilka Carvalho Carvalho, Kimin Lee, Sungryull Sohn, Honglak Lee, Richard L Lewis, Satinder Singh
-
[ Video ]

Reinforcement learning is focused on the problem of learning a near-optimal policy for a given task. But can we use reinforcement learning to instead learn general-purpose policies that can perform a wide range of different tasks, resulting in flexible and reusable skills? Contextual policies provide this capability in principle, but the representation of the context determines the degree of generalization and expressivity. Categorical contexts preclude generalization to entirely new tasks. Goal-conditioned policies may enable some generalization, but cannot capture all tasks that might be desired. In this paper, we propose goal distributions as a general and broadly applicable task representation suitable for contextual policies. Goal distributions are general in the sense that they can represent any state-based reward function when equipped with an appropriate distribution class, while the particular choice of distribution class allows us to trade off expressivity and learnability. We develop an off-policy algorithm called distribution-conditioned reinforcement learning (DisCo RL) to efficiently learn these policies. We evaluate DisCo RL on a variety of robot manipulation tasks and find that it significantly outperforms prior methods on tasks that require generalization to new goal distributions.

Soroush Nasiriany, Vitchyr Pong, Ashvin Nair, Glen Berseth, Sergey Levine
-
[ Video ]

Temporal abstractions in the form of options have been shown to help reinforcement learning (RL) agents learn faster. However, despite previous work on this topic, the problem of discovering options through interaction with an environment remains a challenge. In this paper, we introduce a novel approach for discovering options, via meta-gradients, by interacting with a multi-task reinforcement learning environment. Our approach is based on a manager-worker decomposition of the RL agent, in which a manager maximises rewards from the environment by learning a task-dependent policy over both a set of task-independent discovered-options and primitive actions. The option-reward and termination functions defining each option are parameterised by neural networks and trained via meta-gradients to maximise their usefulness. Extensive empirical analysis on gridworld, Atari, and DeepMind Lab show that: (1) our approach can discover meaningful temporally-extended options in multi-task RL domains, (2) the discovered options are frequently used by the agent while learning to solve the training tasks, and (3) that the discovered options help a randomly initialised manager learn faster in completely new tasks.

Vivek Veeriah
-
[ Video ]

Deep reinforcement learning (DRL) algorithms have successfully been demonstrated on a range of challenging decision making and control tasks. One dominant component of recent deep reinforcement learning algorithms is the target network which mitigates the divergence when learning the Q function. However, target networks can slow down the learning process due to delayed function updates. Our main contribution in this work is a self-regularized TD-learning method to address divergence without requiring a target network. Additionally, we propose a self-guided policy improvement method by combining policy-gradient with zero-order optimization to search for actions associated with higher Q-values in a broad neighborhood. This makes learning more robust to local noise in the Q function approximation and guides the updates of our actor network. Taken together, these components define GRAC, a novel self-guided and self-regularized actor critic algorithm. We evaluate GRAC on the suite of OpenAI gym tasks, achieving or outperforming state of the art in every environment tested.

Lin Shao, Mengyuan Yan, Qingyun Sun, Christin Jeannette Bohg
-
[ Video ]
Quality-Diversity (QD) is a concept from Neuroevolution with some intriguing applications to Reinforcement Learning. It facilitates learning a population of agents where each member is optimized to simultaneously accumulate high task-returns and exhibit behavioral diversity compared to other members. In this paper, we build on a recent kernel-based method for training a QD policy ensemble with Stein variational gradient descent. With kernels based on $f$-divergence between the stationary distributions of policies, we convert the problem to that of efficient estimation of the ratio of these stationary distributions. We then study various distribution ratio estimators used previously for off-policy evaluation and imitation and re-purpose them to compute the gradients for policies in an ensemble such that the resultant population is diverse and of high-quality.
Tanmay Gangwani, Yuan Zhou
-
[ Video ]

We study the problem of obtaining accurate policy gradient estimates using a finite number of samples. Monte-Carlo methods have been the default choice for policy gradient estimation, despite suffering from high variance in the gradient estimates. On the other hand, more sample efficient alternatives like Bayesian quadrature methods are less scalable due to their high computational complexity. In this work, we propose deep Bayesian quadrature policy gradient (DBQPG), a computationally efficient high-dimensional generalization of Bayesian quadrature, for policy gradient estimation. We show that DBQPG can substitute Monte-Carlo estimation in policy gradient methods, and demonstrate its effectiveness on a set of continuous control benchmarks. In comparison to Monte-Carlo estimation, DBQPG provides (i) more accurate gradient estimates with a significantly lower variance, (ii) a consistent improvement in the sample complexity and average return for several deep policy gradient algorithms, and, (iii) the uncertainty in gradient estimation that can be incorporated to further improve the performance.

Ravi Tej Akella, Anima Anandkumar, Yisong Yue
-
Poster: PixL2R: Guiding Reinforcement Learning Using Natural Language by Mapping Pixels to Rewards (Poster) [ Video ]
Prasoon Goyal, Scott Niekum, Ray Mooney
-
Poster: A Policy Gradient Method for Task-Agnostic Exploration (Poster) [ Video ]
Mirco Mutti, Marcello Restelli
-
Poster: Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning (Poster) [ Video ]
Rishabh Agarwal, Marlos C. Machado, Pablo Samuel Castro, Marc Bellemare
-
Poster: Skill Transfer via Partially Amortized Hierarchical Planning (Poster) [ Video ]
Kevin Xie, Homanga Bharadhwaj, Danijar Hafner, Animesh Garg, Florian Shkurti
-
Poster: On Effective Parallelization of Monte Carlo Tree Search (Poster) [ Video ]
Anji Liu, Yitao Liang, Ji Liu, Guy Van den Broeck, Jianshu Chen
-
Poster: Mastering Atari with Discrete World Models (Poster) [ Video ]
Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, Jimmy Ba
-
Poster: Average Reward Reinforcement Learning with Monotonic Policy Improvement (Poster) [ Video ]
Yiming Zhang, Keith Ross
-
Poster: Combating False Negatives in Adversarial Imitation Learning (Poster) [ Video ]
Konrad Żołna, Dzmitry Bahdanau, Yoshua Bengio
-
Poster: Evaluating Agents Without Rewards (Poster) [ Video ]
Jimmy Ba, Danijar Hafner
-
Poster: Learning Latent Landmarks for Generalizable Planning (Poster) [ Video ]
Ge Yang, Bradly Stadie
-
Poster: Conservative Safety Critics for Exploration (Poster) [ Video ]
Homanga Bharadhwaj, Aviral Kumar, Nick Rhinehart, Sergey Levine, Florian Shkurti, Animesh Garg
-
Poster: Solving Compositional Reinforcement Learning Problems via Task Reduction (Poster) [ Video ]
YI WU, Xiaolong Wang, Huazhe Xu
-
[ Video ]

We initiate the study on deep reinforcement learning problems that require lows witching cost, i.e., a small number of policy switches during training. Such a requirement is ubiquitous in many applications, such as medical domains, recommendation systems, education, robotics, dialogue agents, etc, where the deployed policy that actually interacts with the environment cannot change frequently. Our paper investigates different policy switching criteria based on deep Q-networks and further proposes an adaptive approach based on the feature distance between the deployed Q-network and the underlying learning Q-network. Through extensive experiments on a medical treatment environment and a collection of the Atari games, we find our feature-switching criterion substantially decreases the switching cost while maintains a similar sample efficiency to the case without the low-switching-cost constraint. We also complement this empirical finding with a theoretical justification from a representation learning perspective.

YI WU
-
[ Video ]

Action values are ubiquitous in reinforcement learning (RL) methods, with the sample complexity of such methods relying heavily on how fast a good estimator for action value can be learned. Viewing this problem through the lens of representation learning, naturally good representations of both state and action can facilitate action-value estimation. While advances in deep learning have seamlessly driven progress in learning state representations, given the specificity of the notion of agency to RL, little attention has been paid to learning action representations. We conjecture that leveraging the combinatorial structure of multidimensional action spaces is a key ingredient for learning good representations of action. In order to test this, we set forth the action hypergraph networks framework---a class of functions for learning action representations with a relational inductive bias. Using this framework we realise an agent class based on a combination with deep Q-networks, which we dub hypergraph Q-networks. We show the effectiveness of our approach on a myriad of domains: illustrative prediction problems under minimal confounding effects, Atari 2600 games, and physical control benchmarks.

Arash Tavakoli, Mehdi Fatemi
-
Poster: Addressing Distribution Shift in Online Reinforcement Learning with Offline Datasets (Poster) [ Video ]
Younggyo Seo, Jinwoo Shin, Pieter Abbeel, Kimin Lee
-
[ Video ]

Simulators perform an important role in prototyping, debugging and benchmarking new advances in robotics and learning for control. Although many physics engines exist, some aspects of the real-world are harder than others to simulate. One of the aspects that have so far eluded accurate simulation is touch sensing. To address this gap, we present TACTO -- a fast, flexible and open-source simulator for vision-based tactile sensors. This simulator allows to render realistic high-resolution touch readings at hundreds of frames per second, and can be easily configured to simulate different vision-based tactile sensors, including GelSight, DIGIT and OmniTact. In this paper, we detail the principles that drove the implementation of TACTO and how they are reflected in its architecture. We demonstrate TACTO on a perceptual task, by learning to predict grasp stability using touch from 1 million grasps, and on a marble manipulation control task. We believe that TACTO is a step towards the widespread adoption of touch sensing in robotic applications, and to enable machine learning practitioners interested in multi-modal learning and control. TACTO is open-sourced at https://github.com/[anonymized].

Roberto Calandra
-
[ Video ]

In this paper, we tackle the problem of learning control policies for tasks when provided with constraints in natural language. In contrast to instruction following, language here is used not to specify goals, but rather to describe situations that an agent must avoid during its exploration of the environment. Specifying constraints in natural language also differs from the predominant paradigm in safe reinforcement learning, where safety criteria are enforced by hand-defined cost functions. While natural language allows for easy and flexible specification of safety constraints and budget limitations, its ambiguous nature presents a challenge when mapping these specifications into representations that can be used by techniques for safe reinforcement learning. To address this, we develop a model that contains two components: (1) a constraint interpreter to encode natural language constraints into vector representations capturing spatial and temporal information on forbidden states, and (2) a policy network that uses these representations to output a policy with minimal constraint violations. Our model is end-to-end differentiable and we train it using a recently proposed algorithm for constrained policy optimization. To empirically demonstrate the effectiveness of our approach, we create a new benchmark task for autonomous navigation with crowd-sourced free-form text specifying three different types of constraints. Our method outperforms several baselines by achieving 6-7 times higher returns and 76% fewer constraint violations on average. Dataset and code to reproduce our experiments are available at https://sites.google.com/view/polco-hazard-world/.

Peter J Ramadge, Karthik Narasimhan, Yinlam Chow
-
[ Video ]

We propose the k-Shortest-Path (k-SP) constraint: a novel constraint on the agent’s trajectory that improves the sample-efficiency in sparse-reward MDPs. We show that any optimal policy necessarily satisfies the k-SP constraint. Notably, the k-SP constraint prevents the policy from exploring state-action pairs along the non-k-SP trajectories (e.g., going back and forth). However, in practice, excluding state-action pairs may hinder convergence of many RL algorithms. To overcome this, we propose a novel cost function that penalizes the policy violating SP constraint, instead of completely excluding it. Our numerical experiment in a tabular RL setting demonstrate that the SP constraint can significantly reduce the trajectory space of policy. As a result, our constraint enables more sample efficient learning by suppressing redundant exploration and exploitation. Our empirical experiment results on MiniGrid and DeepMind Lab show that the proposed method significantly improves proximal policy optimization (PPO) and outperforms existing novelty-seeking exploration methods including count-based exploration, indicating that it improves the sample efficiency by preventing the agent from taking redundant actions.

Sungryull Sohn, Sungtae Lee, Jongwook Choi
-
[ Video ]

Treatment recommendation is a complex multi-faceted problem with many conflicting objectives, e.g., optimizing the survival rate (or expected lifetime), mitigating negative impacts, reducing financial expenses and time costs, avoiding over-treatment, etc. While this complicates the hand-engineering of a reward function for learning treatment policies, fortunately, qualitative feedback from human experts is readily available and can be easily exploited. Since direct estimation of rewards via inverse reinforcement learning is a challenging task and requires the existence of an optimal human policy, the field of treatment recommendation has recently witnessed the development of the preference-based Reinforcement Learning (PRL) framework, which infers a reward function from only qualitative and imperfect human feedback to ensure that a human expert’s preferred policy has a higher expected return over a less preferred policy. In this paper, we first present an open simulation platform to model the progression of two diseases, namely Cancer and Sepsis, and the reactions of the affected individuals to the received treatment. Secondly, we investigate important problems in adopting preference-based RL approaches for treatment recommendation, such as advantages of learning from preference over hand-engineered reward, addressing incomparable policies, reward interpretability, and agent design via simulated experiments. The designed simulation platform and insights obtained for preference-based RL approaches are beneficial for achieving the right trade-off between various human objectives during treatment recommendation.

Nitin Kamra, Yan Liu
-
[ Video ]
Efficiently training agents with planning capabilities has long been one of the major challenges in decision-making. In this work, we focus on zero-shot navigation ability on a given abstract 2-D occupancy map, like human navigation by reading a paper map, by treating it as an image. To learn this ability, we need to efficiently train an agent on environments with a small proportion of training maps and share knowledge effectively across the environments. We hypothesize that model-based navigation can better adapt an agent's behaviors to a task, since it disentangles the variations in map layout and goal location and enables longer-term planning ability on novel locations compared to reactive policies. We propose to learn a hypermodel that can understand patterns from a limited number of abstract maps and goal locations, to maximize alignment between the hypermodel predictions and real trajectories to extract information from multi-task off-policy experiences, and to construct denser feedback for planners by $n$-step goal relabelling. We train our approach on DeepMind Lab environments with layouts from different maps, and demonstrate superior performance on zero-shot transfer to novel maps and goals.
Linfeng Zhao, Lawson Wong
-
[ Video ]

Recently, deep learning has been successfully applied to a variety of networking problems. A fundamental challenge is that when the operational environment for a learning-augmented system differs from its training environment, such systems often make badly informed decisions, leading to bad performance. We argue that safely deploying learning-driven systems requires being able to determine, in real time, whether system behavior is coherent, for the purpose of defaulting to a reasonable heuristic when this is not so. We term this the online safety assurance problem (OSAP). We present three approaches to quantifying decision uncertainty that differ in terms of the signal used to infer uncertainty. We illustrate the usefulness of online safety assurance in the context of the proposed deep reinforcement learning (RL) approach to video streaming. While deep RL for video streaming bests other approaches when the operational and training environments match, it is dominated by simple heuristics when the two differ. Our preliminary findings suggest that transitioning to a default policy when decision uncertainty is detected is key to enjoying the performance benefits afforded by leveraging ML without compromising on safety.

Noga H. Rotman, Aviv Tamar
-
[ Video ]

Deploying Reinforcement Learning (RL) agents in the real-world require that the agents satisfy safety constraints. Current RL agents explore the environment without considering these constraints, which can lead to damage to the hardware or even other agents in the environment. We propose a new method, LBPO, that uses a Lyapunov-based barrier function to restrict the policy update to a safe set for each training iteration. Our method also allows the user to control the conservativeness of the agent with respect to the constraints in the environment. LBPO significantly outperforms state-of-the-art baselines in terms of the number of constraint violations during training while being competitive in terms of performance. Further, our analysis reveals that baselines like CPO and SDDPG rely mostly on backtracking to ensure safety rather than safe projection, which provides insight into why previous methods might not have effectively limit the number of constraint violations.

David Held
-
Poster: Evolving Reinforcement Learning Algorithms (Poster) [ Video ]
JD Co-Reyes, Daiyi Peng, Quoc V Le, Sergey Levine, Honglak Lee, Aleksandra Faust
-
[ Video ]

Reinforcement learning has been applied to a wide variety of robotics problems, but most of such applications involve collecting data from scratch for each new task. Since the amount of robot data we can collect for any single task is limited by time and cost considerations, the learned behavior is typically narrow: the policy can only execute the task in a handful of scenarios that it was trained on. What if there was a way to incorporate a large amount of prior data, either from previously solved tasks or from unsupervised or undirected environment interaction, to extend and generalize learned behaviors? While most prior work on extending robotic skills using pre-collected data focuses on building explicit hierarchies or skill decompositions, we show in this paper that we can reuse prior data to extend new skills simply through model-free reinforcement learning via dynamic programming. We show that even when the prior data does not actually succeed at solving the new task, it can still be utilized for learning a better policy, by providing the agent with a broader understanding of the mechanics of its environment. We demonstrate the effectiveness of such an approach by chaining together several behaviors seen in prior datasets for solving a new task, with our hardest experimental setting involving composing four robotic skills in a row: picking, placing, drawer opening, and grasping, where a +1/0 sparse reward is provided only on task completion. We train our policies in an end-to-end fashion, mapping high-dimensional image observations to low-level robot control commands, and present results in both simulated and real world domains.

Avi Singh, Aviral Kumar, Jesse Zhang, Sergey Levine
-
[ Video ]
How much credit (or blame) should an action taken in a state get for a future reward? This is the fundamental temporal credit assignment problem in Reinforcement Learning (RL). One of the earliest and still most widely used heuristics is to assign this credit based on a scalar coefficient $\lambda$ (treated as a hyperparameter) raised to the power of the time interval between the state-action and the reward. In this empirical paper, we explore heuristics based on more general pairwise weightings that are functions of the state in which the action was taken, the state at the time of the reward, as well as the time interval between the two. Of course it isn't clear what these pairwise weight functions should be, and because they are too complex to be treated as hyperparameters we develop a metagradient procedure for learning these weight functions during the usual RL training of a policy. Our empirical work shows that it is often possible to learn these pairwise weight functions during learning of the policy to achieve better performance than competing approaches.
Zeyu Zheng, Richard L Lewis, Satinder Singh
-
[ Video ]

Humans show an innate ability to learn the regularities of the world through interaction. By performing experiments in our environment, we are able to discern the causal factors of variation and infer how they affect the dynamics of our world. Analogously, here we attempt to equip reinforcement learning agents with the ability to perform experiments that facilitate a categorization of the rolled-out trajectories, and to subsequently infer the causal factors of the environment in a hierarchical manner. We introduce a novel intrinsic reward, called causal curiosity, and show that it allows our agents to learn optimal sequences of actions, and to discover causal factors in the dynamics. The learned behavior allows the agent to infer a binary quantized representation for the ground-truth causal factors in every environment. Additionally, we find that these experimental behaviors are semantically meaningful (e.g., to differentiate between heavy and light blocks, our agents learn to lift them), and are learnt in a self-supervised manner with approximately 2.5 times less data than conventional supervised planners. We show that these behaviors can be re-purposed and fine-tuned (e.g., from lifting to pushing or other downstream tasks). Finally, we show that the knowledge of causal factor representations aids zero-shot learning for more complex tasks.

Sumedh A Sontakke, Arash Mehrjou, Laurent Itti, Bernhard Schölkopf
-
[ Video ]

In many real-world tasks, it is not possible to procedurally specify an RL agent's reward function. In such cases, a reward function must instead be learned from interacting with and observing humans. However, current techniques for reward learning may fail to produce reward functions which accurately reflect user preferences. Absent significant advances in reward learning, it is thus important to be able to audit learned reward functions to verify whether they truly capture user preferences. In this paper, we investigate techniques for interpreting learned reward functions. In particular, we apply saliency methods to identify failure modes and predict the robustness of reward functions. We find that learned reward functions often implement surprising algorithms that rely on contingent aspects of the environment. We also discover that existing interpretability techniques often attend to irrelevant changes in reward output, suggesting that reward interpretability may need significantly different methods from policy interpretability.

Eric Michaud, Stuart Russell
-
[ Video ]

Generative Adversarial Imitation Learning suffers from the fundamental problem of reward bias stemming from the choice of reward functions used in the algorithm. Different types of biases also affect different types of environments - which are broadly divided into survival and task-based environments. We provide a theoretical sketch of why existing reward functions would fail in imitation learning scenarios in task based environments with multiple terminal states. We also propose a new reward function for GAIL which outperforms existing GAIL methods on task based environments with single and multiple terminal states and effectively overcomes both survival and termination bias.

-
[ Video ]

Exploration in reinforcement learning is, in general, a challenging problem. In this work, we study a more tractable class of reinforcement learning problems defined by data that provides examples of successful outcome states. In this case, the reward function can be obtained automatically by training a classifier to classify states as successful or not. We argue that, with appropriate representation and regularization, such a classifier can guide a reinforcement learning algorithm to an effective solution. However, as we will show, this requires the classifier to make uncertainty-aware predictions that are very difficult with standard deep networks. To address this, we propose a novel mechanism for obtaining calibrated uncertainty based on an amortized technique for computing the normalized maximum likelihood distribution. We show that the resulting algorithm has a number of intriguing connections to both count-based exploration methods and prior algorithms for learning reward functions from data, while being able to guide algorithms towards the specified goal more effectively. We show how using amortized normalized maximum likelihood for reward inference is able to provide effective reward guidance for solving a number of challenging navigation and robotic manipulation tasks which prove difficult for other algorithms.

Abhishek Gupta, Vitchyr Pong, Aurick Zhou, Sergey Levine
-
[ Video ]

In an effort to overcome limitations of reward-driven feature learning in deep reinforcement learning (RL) from images, we propose decoupling representation learning from policy learning. To this end, we introduce a new unsupervised learning (UL) task, called Augmented Temporal Contrast (ATC), which trains a convolutional encoder to associate pairs of observations separated by a short time difference, under image augmentations and using a contrastive loss. In online RL experiments, we show that training the encoder exclusively using ATC matches or outperforms end-to-end RL in most environments. Additionally, we benchmark several leading UL algorithms by pre-training encoders on expert demonstrations and using them, with weights frozen, in RL agents; we find that agents using ATC-trained encoders outperform all others. We also train multi-task encoders on data from multiple environments and show generalization to different downstream RL tasks. Finally, we ablate components of ATC, and introduce a new data augmentation to enable replay of (compressed) latent images from pre-trained encoders when RL requires augmentation. Our experiments span visually diverse RL benchmarks in DeepMind Control, DeepMind Lab, and Atari, and our complete code is available at \url{hidden url}.

Adam Stooke, Kimin Lee, Misha Laskin
-
[ Video ]

The ability to plan into the future while utilizing only raw high-dimensional observations, such as images, can provide autonomous agents with broad and general capabilities. However, realistic tasks may require handling sparse rewards and performing temporally extended reasoning, and cannot be solved with only myopic, short-sighted planning. Recent work in model-based reinforcement learning (RL) has shown impressive results in efficient and general skill acquisition directly from image inputs, but typically with rewards that are heavily shaped and require only short-horizon reasoning. Many of the improvements in this area focus on better modeling. In this work, we instead study how better trajectory optimization and planning can enable more effective long-horizon reasoning with visual inputs. We draw on the idea of collocation-based planning and adapt it to the visual planning domain by leveraging probabilistic latent variable models, resulting in an algorithm that optimizes trajectories over latent variables to solve temporally extended tasks. Our latent collocation method (LatCo) provides a general and effective approach to longer-horizon reasoning for image-based control. Empirically, we demonstrate that our approach significantly outperforms prior model-based approaches on challenging visual control tasks with sparse rewards and long-term goals. See the videos on the supplementary website \url{https://sites.google.com/view/latco-mbrl/}.

Oleh Rybkin, Chuning Zhu, Anusha Nagabandi, Kostas Daniilidis, Igor Mordatch
-
[ Video ]

Reinforcement learning algorithms require considerable manual engineering, especially in designing reward functions that enable effective learning and accurately reflect the desired behavior. In this paper, we provide a new perspective on reinforcement learning that provides a framework for dispensing with hand-designed reward functions altogether. This framework recasts reinforcement learning as a problem of inferring actions that achieve desired outcomes, rather than a problem of maximizing rewards. To solve the resulting goal-directed inference problem, we establish a novel variational inference formulation that allows us to derive a well-shaped reward function which can be learned directly from environment interactions. From the corresponding variational objective, we also derive a new probabilistic Bellman backup operator reminiscent of the standard Bellman backup operator and use it to develop an off-policy algorithm to solve goal-directed tasks. We empirically demonstrate that this method eliminates the need to design reward functions and leads to effective goal-directed behaviors.

Tim G. J. Rudner, Vitchyr Pong, Rowan McAllister, Yarin Gal, Sergey Levine
-
[ Video ]

AlphaStar, the AI that reaches GrandMaster level in StarCraft II, is a remarkable milestone demonstrating what deep reinforcement learning can achieve in complex Real-Time Strategy (RTS) games. However, the complexities of the game, algorithms and systems, and especially the tremendous amount of computation needed are big obstacles for the community to conduct further research in this direction. We propose a deep reinforcement learning agent, StarCraft Commander (SCC). With order of magnitude less computation, it demonstrates top human performance defeating GrandMaster players in test matches and top professional players in a live event. Moreover, it shows strong robustness to various human strategies and discovers novel strategies unseen from human plays. In this paper, we'll share the key insights and optimizations on efficient imitation learning and reinforcement learning for StarCraft II full game.

-
[ Video ]

Prioritized experience replay (PER) samples important transitions, rather than uniformly, to improve the performance of a deep reinforcement learning agent. We claim that such prioritization has to be balanced with sample diversity for making the DQN stabilized and preventing forgetting. Our proposed improvement over PER, called Predictive PER (PPER), takes three measures (TDInit, TDClip, TDPred) to (i) eliminate priority outliers and explosions and (ii) improve the sample diversity and distributions, weighted by priorities, both leading to stabilizing the DQN. The most notable among the three is the introduction of the second DNN called TDPred to generalize the in-distribution priorities. Ablation study and full experiments with Atari games show that each measure by its own way and PPER contribute to successfully enhancing both stability and performance over PER.

-
[ Video ]

Robots operating in open-world environments must be able to adapt to changing conditions and acquire new skills rapidly. Meta-reinforcement learning (meta-RL) is a promising approach for acquiring this ability that leverages a set of related training tasks to learn a task inference procedure that can learn a new skill given a small amount of experience. However, meta-RL has proven challenging to apply to robots in the real world, largely due to onerous data requirements during meta-training compounded with the challenge of learning from high-dimensional sensory inputs such as images. For single-task RL, latent state models have been shown to improve sample efficiency by accelerating the representation learning process.In this work, we posit that task inference in meta-RL and state inference in latent state models can both be viewed as instances of a more general procedure for estimating hidden variables from experience. Leveraging this insight, we present MELD: a practical algorithm for meta-RL from image observations that quickly acquires new skills via posterior inference in a learned latent state model over joint state and task variables. We show that MELD outperforms prior meta- RL methods on a range of simulated robotic locomotion and manipulation problems including peg insertion and object placing. Further, we demonstrate MELD on two real robots, learning to perform peg insertion into varying target boxes with a Sawyer robot, and learning to insert an ethernet cable into new locations after only 4 hours of meta-training on a WidowX robot.

Anusha Nagabandi, Kate Rakelly, Chelsea Finn, Sergey Levine
-
Poster: Dream and Search to Control: Latent Space Planning for Continuous Control (Poster) [ Video ]
, Somdeb Majumdar
-
Poster: Explanation Augmented Feedback in Human-in-the-Loop Reinforcement Learning (Poster) [ Video ]
Ruohan Zhang, Subbarao Kambhampati
-
Poster: Goal-Conditioned Reinforcement Learning in the Presence of an Adversary (Poster) [ Video ]
Carlos Purves, Pietro Liò, Catalina Cangea
-
Poster: Regularized Inverse Reinforcement Learning (Poster)
Wonseok Jeon, Chen-Yang Su, Thang DOAN, Derek Nowrouzezahrai, Joelle Pineau
-
Poster: Domain Adversarial Reinforcement Learning (Poster) [ Video ]
Bonnie Li, Vincent Francois-Lavet, Thang DOAN
-
Poster: Safety Aware Reinforcement Learning (Poster) [ Video ]
Santiago Miret, Somdeb Majumdar, Carroll Wainwright
-
Poster: Sample Efficient Training in Multi-Agent AdversarialGames with Limited Teammate Communication (Poster) [ Video ]
Hardik Meisheri, Harshad Khadilkar
-
Poster: Amortized Variational Deep Q Network (Poster) [ Video ]
Haotian Zhang, Yuhao Wang, Jianyong Sun, Zongben Xu
-
Poster: Disentangled Planning and Control in Vision Based Robotics via Reward Machines (Poster) [ Video ]
Jacob Varley, Deepali Jain
-
Poster: Maximum Mutation Reinforcement Learning for Scalable Control (Poster) [ Video ]
Karush Suri, XIAO QI SHI
-
Poster: Unsupervised Task Clustering for Multi-Task Reinforcement Learning (Poster) [ Video ]
Johannes Ackermann, Oliver Richter, Roger Wattenhofer
-
Poster: Learning Intrinsic Symbolic Rewards in Reinforcement Learning (Poster) [ Video ]
Santiago Miret, Somdeb Majumdar
-
Poster: Preventing Value Function Collapse in Ensemble Q-Learning by Maximizing Representation Diversity (Poster) [ Video ]
Ladislau Boloni
-
Poster: Action and Perception as Divergence Minimization (Poster) [ Video ]
Danijar Hafner, Jimmy Ba, Karl Friston, Nicolas Heess
-
Poster: Randomized Ensembled Double Q-Learning: Learning Fast Without a Model (Poster) [ Video ]
Xinyue Chen, Che Wang, Zijian Zhou, Keith Ross
-
Poster: D2RL: Deep Dense Architectures in Reinforcement Learning (Poster) [ Video ]
Samarth Sinha, Homanga Bharadhwaj, Aravind Srinivas, Animesh Garg
-
Poster: Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms (Poster) [ Video ]
Eugene Vinitsky, Yu Wang, YI WU
-
Poster: Discovering Diverse Multi-Agent Strategic Behavior via Reward Randomization (Poster) [ Video ]
YI WU, Huazhe Xu, Xiaolong Wang, Fei Fang, Yu Wang
-
Poster: What Matters for On-Policy Deep Actor-Critic Methods? A Large-Scale Study (Poster) [ Video ]
Marcin Andrychowicz, Piotr Stanczyk, Raphaël Marinier, Leonard Hussenot, Matthieu Geist, Olivier Pietquin, Sylvain Gelly, Olivier Bachem
-
Poster: Semantic State Representation for Reinforcement Learning (Poster) [ Video ]
Erez Schwartz, Guy Tennenholtz, Chen Tessler, Shie Mannor
-
Poster: Hyperparameter Auto-tuning in Self-Supervised Robotic Learning (Poster) [ Video ]
Jiancong Huang, Paul Weng
-
Poster: Targeted Query-based Action-Space Adversarial Policies on Deep Reinforcement Learning Agents (Poster) [ Video ]
Xian Yeow Lee, Yasaman Esfandiari, Russell Tan, Soumik Sarkar
-
[ Video ]

We propose a novel hierarchical reinforcement learning framework: given user-provided subgoal regions which are subsets of states, it (i) constructs options that serve as transitions between subgoal regions, and (ii) constructs a high-level plan in the resulting abstract MDP. The key challenge is that the abstract MDP may not be Markov. We propose two algorithms for planning in the abstract MDP that address this issue. Our first algorithm is conservative, allowing us to prove theoretical guarantees on its performance; these results can help guide the design of subgoal regions. Our second algorithm is a practical variant that interweaves planning at the abstract level using value iteration and at the concrete level using model-free reinforcement learning. We demonstrate the benefits of our approach on several benchmarks that are challenging for state-of-the-art hierarchical reinforcement learning algorithms.

Osbert Bastani, Rajeev Alur, Kishor Jothimurugan
-
[ Video ]

Recent advances in off-policy deep reinforcement learning (RL) have led to impressive success in complex tasks from visual observations. Experience replay improves sample-efficiency by reusing experiences from the past, and convolutional neural networks (CNNs) process high-dimensional inputs effectively. However, such techniques demand high memory and computational bandwidth. In this paper, we present Latent Vector Experience Replay (LeVER), a simple modification of existing off-policy RL methods, to address these computational and memory requirements without sacrificing the performance of RL agents. To reduce the computational overhead of gradient updates in CNNs, we freeze the lower layers of CNN encoders early in training due to early convergence of their parameters. Additionally, we reduce memory requirements by storing the low-dimensional latent vectors for experience replay instead of high-dimensional images, enabling an adaptive increase in the replay buffer capacity, a useful technique in constrained-memory settings. In our experiments, we show that LeVER does not degrade the performance of RL agents while significantly saving computation and memory across a diverse set of DeepMind Control environments and Atari games. Finally, we show that LeVER is useful for computation-efficient transfer learning in RL because lower layers of CNNs extract generalizable features, which can be used for different tasks and domains.

Aravind Srinivas, Kimin Lee
-
[ Video ]

In order for autonomous vehicles to share the road safely with human drivers, autonomous vehicles must abide by certain "road rules" that human drivers have agreed all road users must follow. "Road rules" include rules that drivers are required to follow by law – such as the requirement that vehicles stop at red lights – as well as more subtle social rules – such as the implicit designation of fast lanes on the highway. In this paper, we provide empirical evidence that suggests that – instead of hard-coding these road rules into self-driving algorithms – a scalable alternative may be to design multi-agent environments such that agents within the environments discover for themselves that these road rules are mutually beneficial to follow. We analyze what components of our chosen multi-agent environment cause the emergence of such behavior and find that two crucial factors are noisy perception and the spatial density of agents. We provide qualitative and quantitative evidence of the emergence of seven social driving behaviors, ranging from stopping at a traffic signal to following lanes. Our results add empirical support for the social road rules that countries around the world have agreed on for safe driving.

Yuan-Hong Liao, Sanja Fidler
-
[ Video ]

We seek to theoretically formalize how credit assignment affects modularity in the decisions of a reinforcement learning system. To this end, we introduce an algorithmic causal model of reinforcement learning by expressing both the Markov decision process and credit assignment process under the same causal graph. These graphs show that under certain conditions, local credit assignment mechanisms, such as those employed in certain temporal difference learning algorithms and the recently proposed societal decision-making framework, induce modularity in the reinforcement learner through algorithmic independencies that are not maintained with more global credit assignment mechanisms, such as those obtained via Monte Carlo estimation in policy gradients. Empirically we observe that this independence translates to more efficient transfer learning when the Markov decision process changes in a way that should only affect a single decision.

Michael Chang, Tom Griffiths, Sergey Levine, Kimin Lee
-
[ Video ]

This paper investigates how to weight imperfect expert demonstrations for generative adversarial imitation learning. The agent is expected to perform behaviors demonstrated by experts. But in many applications, experts could also make mistakes and their demonstrations would mislead or slow the learning process of the agent. Recently, existing imitation learning from imperfect demonstrations methods mostly focus on using the preference or confidence scores to distinguish imperfect demonstrations. However, these auxiliary information needs to be collected with the help of an oracle, which is usually hard and expensive to afford in practice. In contrast, this paper proposes a method of learning to weight imperfect demonstrations in generative adversarial imitation learning (GAIL) without imposing too much prior information. We provide a rigorous mathematical analysis, presenting that the weights of demonstrations can be exactly determined by combining the discriminator and agent policy in GAIL. Theoretical analysis suggests that with the estimated weights the agent can learn a new better policy beyond those plain expert demonstrations. Experiments in the Mujoco and Atari environments demonstrate the proposed algorithm outperforms baseline methods in handling imperfect expert demonstrations.

Bo Du, Yunke Wang, Chang Xu
-
[ Video ]
Planning in large state spaces inevitably needs to balance depth and breadth of the search. It has a crucial impact on planners performance and most manage this interplay implicitly. We present a novel method \textit{Shoot Tree Search (STS)}, which makes it possible to control this trade-off more explicitly. Our algorithm can be understood as an interpolation between two celebrated search mechanisms: MCTS and random shooting. It also lets the user control the bias-variance trade-off, akin to $TD(n)$, but in the tree search context. In experiments on challenging domains, we show that STS can get the best of both worlds consistently achieving higher scores.
Piotr Januszewski
-
Poster: Parameter-based Value Functions (Poster) [ Video ]
Francesco Faccio, Jürgen Schmidhuber
-
Poster: Influence-aware Memory for Deep Reinforcement Learning in POMDPs (Poster) [ Video ]
Miguel Suau de Castro, Jinke He, Rolf Starre, Frans Oliehoek
-
Poster: Modular Training, Integrated Planning Deep Reinforcement Learning for Mobile Robot Navigation (Poster) [ Video ]
Greg Kahn
-
Poster: How to make Deep RL work in Practice (Poster) [ Video ]
-
Poster: Super-Human Performance in Gran Turismo Sport Using Deep Reinforcement Learning (Poster) [ Video ]
Davide Scaramuzza
-
Poster: Which Mutual-Information Representation Learning Objectives are Sufficient for Control? (Poster) [ Video ]
Kate Rakelly, Abhishek Gupta, Carlos Florensa, Sergey Levine
-
[ Video ]

Engineering reward functions to successfully train an agent and generalize to downstream tasks is challenging. Automated Curriculum Learning (ACL) proposes to generate tasks instead. This avoids spurious optima and reduces engineering effort. Recent works in ACL generate policies and tasks jointly such that the policies can be used as an initialization to quickly learn downstream tasks. However, these methods expose the agent to task distribution shifts, which hurt its generalization capability. We automatically design a curriculum that helps overcome this drawback. First, we show that retraining the agent from scratch on a stationary task distribution improves generalization. Second, given a generated task, we construct an easier task by distilling the reward function into a narrower network reducing reward sparsity. Using this distilled task to form a simple curriculum, we obtain better generalization to downstream tasks than our baseline retrained agent.

Jürgen Schmidhuber
-
Poster: Self-Supervised Policy Adaptation during Deployment (Poster) [ Video ]
Yu Sun, Alexei Efros, Lerrel Pinto, Xiaolong Wang
-
Poster: Trust, but verify: model-based exploration in sparse reward environments (Poster) [ Video ]
Lukasz Kucinski, Konrad Czechowski
-
[ Video ]

We study how effectively decentralized autonomous vehicles (AVs), operating at low penetration rates, can be used to optimize the throughput of a scaled model of the San Francisco Oakland Bay Bridge. In Flow, a library for applying deep reinforcement learning to traffic micro-simulators, we consider the problem of improving the throughput of a traffic benchmark: a two-stage bottleneck where four lanes reduce to two and then reduce to one. Although there is extensive work examining variants of this problem in a centralized setting, there is less study of the challenging multi-agent setting where the large number of interacting AVs leads to significant optimization challenges. We apply multi-agent reinforcement algorithms to this problem and demonstrate that significant improvements in bottleneck throughput, from 20% at a 5% penetration rate to 33% at a 40% penetration rate, can be achieved. We compare our results to a hand-designed feedback controller and demonstrate that our results sharply outperform the feedback controller despite extensive tuning of the latter. Finally, we demonstrate that macroscopic sensing is not needed; these control scheme can be performed using information available using local sensors such as radar or LIDAR. This suggests the possibility of deploying these schemes in the near future to improve traffic efficiency.

Eugene Vinitsky, Kanaad V Parvate
-
[ Video ]

Deep reinforcement learning has been one of the fastest growing fields of machine learning over the past years and numerous libraries have been open sourced to support research. However, most code bases have a steep learning curve or limited flexibility that do not satisfy a need for fast prototyping in fundamental research. This paper introduces Tonic, a Python library allowing researchers to quickly implement new ideas and measure their importance by providing: 1) a collection of configurable modules such as exploration strategies, memories, neural networks, and updaters 2) a collection of baseline agents: A2C, PPO, TRPO, MPO, DDPG, D4PG, TD3 and SAC built with these modules 3) support for the two most popular deep learning frameworks: TensorFlow 2 and PyTorch 4) support for the three most popular sets of continuous-control environments: OpenAI Gym, DeepMind Control Suite and PyBullet 5) a large-scale benchmark of the baseline agents on 70 continuous-control tasks 6) scripts to experiment in a reproducible way, plot results, and play with trained agents.

Fabio Pardo
-
[ Video ]

Since the introduction of DQN, a vast majority of reinforcement learning research has focused on reinforcement learning with deep neural networks as function approximators. New methods are typically evaluated on a set of environments that have now become standard, such as Atari 2600 games. While these benchmarks help standardize evaluation, their computational cost has the unfortunate side effect of widening the gap between those with ample access to computational resources, and those without. In this work we argue that, despite the community's emphasis on large-scale environments, the traditional small-scale environments can still yield valuable scientific insights and can help reduce the barriers to entry for underprivileged communities. To substantiate our claims, we empirically revisit paper which introduced the Rainbow algorithm [Hessel et al., 2018] and present some new insights into the algorithms used by Rainbow.

Pablo Samuel Castro, Johan Obando Ceron
-
[ Video ]

Temporal information is essential to learning effective policies with Reinforcement Learning (RL). However, current state-of-the-art RL algorithms either assume that such information is given as part of the state space or, when learning from pixels, use the simple heuristic of frame-stacking to implicitly capture temporal information present in the image observations. This heuristic is in contrast to the current paradigm in video classification architectures, which utilize explicit encodings of temporal information through methods such as optical flow and two-stream architectures to achieve state-of-the-art performance. Inspired by leading video classification architectures, we introduce the Flow of Latents for Reinforcement Learning (Flare), a network architecture for RL that explicitly encodes temporal information through latent vector differences. We show thatFlare (i) recovers optimal performance in state-based RL without explicit access to the state velocity, solely with positional state information, (ii) achieves state-of-the-art performance on pixel-based continuous control tasks within the DeepMindcontrol benchmark suite, and (iii) is the most sample efficient model-free pixel-based RL algorithm on challenging environments in the DeepMind control suite such as quadruped walk, hopper hop, finger turn hard, pendulum swing, and walker run, outperforming the prior model-free state-of-the-art by1.9× and1.5× on the 500k and 1M step benchmarks, respectively.

Aravind Srinivas, Aravind Rajeswaran, Wenling Shang, Yang Gao, Misha Laskin
-
[ Video ]

Standard dynamics models for continuous control make use of feedforward computation and assume that different dimensions of the next state and reward are conditionally independent given the current state and action. This modeling choice may be driven by the fact that fully observable physics-based simulation environments entail deterministic transition dynamics. In this paper, we question this conditional independence assumption and propose a family of expressive autoregressive dynamics models that generate different dimensions of the next state and reward sequentially conditioned on previous dimensions. We demonstrate that autoregressive dynamics models indeed outperform standard feedforward models in log-likelihood on heldout trajectories. Furthermore, we compare different model-based and model-free off-policy evaluation (OPE) methods on RL Unplugged, a suite of offline MuJoCo datasets, and find that autoregressive dynamics models consistently outperform all baselines, achieving a new state-of-the-art. Finally, we show that autoregressive dynamics models are useful for offline policy optimization by serving as a way to enrich the replay buffer through data augmentation and improving performance using model-based planning.

Michael Zhang, Thomas Paine, Ofir Nachum, Cosmin Paduraru, George Tucker, Ziyu Wang, Mohammad Norouzi
-
Poster: AWAC: Accelerating Online Reinforcement Learning With Offline Datasets (Poster) [ Video ]
Ashvin Nair, Murtaza Dalal, Abhishek Gupta, Sergey Levine
-
Poster: Inter-Level Cooperation in Hierarchical Reinforcement Learning (Poster) [ Video ]
Sergey Levine, Alexandre Bayen
-
[ Video ]

Context, the embedding of previous collected trajectories, is a powerful construct for Meta-Reinforcement Learning (Meta-RL) algorithms. By conditioning on an effective context, Meta-RL policies can easily generalize to new tasks within a few adaptation steps. We argue that improving the quality of context involves answering two questions: 1. How to train a compact and sufficient encoder that can embed the task-specific information contained in prior trajectories? 2. How to collect informative trajectories of which the corresponding context reflects the specification of tasks? To this end, we propose a novel Meta-RL framework called CCM (Contrastive learning augmented Context-based Meta-RL). We first focus on the contrastive nature behind different tasks and leverage it to train a compact and sufficient context encoder. Further, we train a separate exploration policy and theoretically derive a new information-gain-based objective which aims to collect informative trajectories in a few steps. Empirically, we evaluate our approaches on common benchmarks as well as several complex sparse-reward environments. The experimental results show that CCM outperforms state-of-the-art algorithms by addressing previously mentioned problems respectively.

Chen Chen, Wulong Liu
-
[ Video ]

In this paper, we investigate learning temporal abstractions in cooperative multi-agent systems, using the popular options framework. We address the planning problem for the decentralized POMDP represented by the multi-agent system by introducing a common information approach. We use the notion of common beliefs and broadcasting to solve an equivalent centralized POMDP problem. We propose the Distributed Option Critic (DOC) algorithm in its on- and off-policy forms, both of which use centralized option evaluation and decentralized intra-option improvement and employs common information approach for belief update. We theoretically analyze the asymptotic convergence of DOC and test its validity on diverse multi-agent environments. Moreover, we study the impact of different broadcast schemes on the overall performance while inducing cooperation in social dilemma games. Our experiments empirically show that DOC performs competitively against baselines and scales with the number of options.

Abhinav Gupta, Jhelum Chakravorty, Jikun Kang, Xue (Steve) Liu, Doina Precup
-
[ Video ]

Self-supervised learning and data augmentation have significantly reduced the performance gap between state and image-based reinforcement learning agents in continuous control tasks. However, it is still unclear whether current techniques can face the variety of visual conditions required by real-world environments. We propose a challenging benchmark that tests agents' visual generalization by adding graphical variety to existing continuous control domains. Our empirical analysis shows that current methods struggle to generalize across a diverse set of visual changes, and we examine the specific factors of variation that make these tasks difficult. We find that data augmentation techniques outperform self-supervised learning approaches, and that more significant image transformations provide better visual generalization.

Jake Grigsby, Yanjun Qi
-
[ Video ]

Most existing policy learning solutions require the learning agents to receive high-quality supervision signals, e.g., rewards in reinforcement learning (RL) or high-quality expert's demonstrations in behavioral cloning (BC). These quality supervisions are either infeasible or prohibitively expensive to obtain in practice. We aim for a unified framework that leverages the weak supervisions to perform policy learning efficiently. To handle this problem, we treat the "weak supervisions" as imperfect information coming from a peer agent, and evaluate the learning agent's policy based on a "correlated agreement" with the peer agent's policy (instead of simple agreements). Our way of leveraging peer agent's information offers us a family of solutions that learn effectively from weak supervisions with theoretical guarantees. Extensive evaluations on tasks including RL with noisy reward, BC with weak demonstrations and standard policy co-training (RL + BC) show that the proposed approach leads to substantial improvements, especially when the complexity or the noise of the learning environments grows.

Yang Liu
-
[ Video ]

Deep reinforcement learning (RL) agents are able to learn contact-rich manipulation tasks by maximizing a reward signal, but require large amounts of experience, especially in environments with many obstacles that complicate exploration. In contrast, motion planners use explicit models of the agent and environment to plan collision-free paths to faraway goals, but suffer from inaccurate models in tasks that require contacts with the environment. To combine the benefits of both approaches, we propose motion planner augmented RL (MoPA-RL) which augments the action space of an RL agent with the long-horizon planning capabilities of motion planners. Based on the magnitude of the action, our approach smoothly transitions between directly executing the action and invoking a motion planner. We evaluate our approach on various simulated manipulation tasks and compare it to alternative action spaces in terms of learning efficiency and safety. The experiments demonstrate that MoPA-RL increases learning efficiency, leads to a faster exploration of the environment, and results in safer policies that avoid collisions with the environment.

Youngwoon Lee, Karl Pertsch, Max Pflueger, Gaurav Sukhatme, Joseph Lim
-
[ Video ]

Advances in visual navigation methods have led to intelligent embodied navigation agents capable of learning meaningful representations from raw RGB images and perform a wide variety of tasks involving structural and semantic reasoning. However, most learning-based navigation policies are trained and tested in simulation environments. In order for these policies to be practically useful, they need to be transferred to the real-world. In this paper, we propose an unsupervised domain adaptation method for visual navigation. Our method translates the images in the target domain to the source domain such that the translation is consistent with the representations learned by the navigation policy. The proposed method outperforms several baselines across two different navigation tasks in simulation. We further show that our method can be used to transfer the navigation policies learned in simulation to the real world.

Devendra Singh Chaplot, Yao-Hung Hubert Tsai, Yue Wu, LP Morency, Ruslan Salakhutdinov
-
[ Video ]

We introduce a method of learning an abstract state representation for Markov Decision Processes (MDPs) with rich observations. We begin by proving that a combination of three conditions is sufficient for a learned state abstraction to retain the Markov property. We then describe a practical training procedure that combines inverse model estimation and temporal contrastive learning to learn an abstraction that approximately satisfies these conditions. We evaluate our approach with a proof-of-concept visual gridworld experiment, where the learned representation captures the underlying structure of the domain and enables substantially improved learning performance over end-to-end deep RL, matching the performance achieved with hand-designed compact state information.

Cameron Allen, Neev Parikh, George Konidaris
-
[ Video ]

The value function lies in the heart of Reinforcement Learning (RL), which defines the long-term evaluation of a policy in a given state. In this paper, we propose Policy-extended Value Function Approximator (PeVFA) which extends the conventional value to be not only a function of state but also an explicit policy representation. Such an extension enables PeVFA to preserve values of multiple policies in contrast to a conventional one with limited capacity for only one policy. This induces a new characteristic of value function, i.e., \emph{value generalization among policies}. From both the theoretical and empirical lens, we study value generalization along the policy improvement path (called local generalization), from which we utilize PeVFA to derive a new form of Generalized Policy Iteration to improve the conventional learning process. Moreover, we propose a framework to learn the representation of an RL policy, studying several different approaches to learn an effective policy representation from policy network parameters and state-action pairs. In our experiments, Proximal Policy Optimization (PPO) with PeVFA significantly outperforms its vanilla counterpart in MuJoCo continuous control tasks, demonstrating the effectiveness of value generalization offered by PeVFA and policy representation learning.

Zhaopeng Meng, Jianye Hao, Chen Chen, Daniel Graves, Wulong Liu
-
Poster: Energy-based Surprise Minimization for Multi-Agent Value Factorization (Poster) [ Video ]
Karush Suri, XIAO QI SHI
-
[ Video ]

Researchers have attempted to leverage the cutting edge techniques of Deep Reinforcement Learning (DRL) to develop autonomous trade execution systems to learn to minimize the execution costs. While several researchers have reported success in developing such autonomous trade execution systems, they all back-test the trade execution policies on historical datasets. One of the biggest drawbacks of back-testing on historical datasets is that it cannot account for the permanent market impacts caused by interactions among various trading agents and real-world factors such as network latency and computational delays.

In this article, we investigate an agent-based market simulator as a back-testing tool. More specifically, we design agents which use the trade execution policies learned by two previously proposed Deep Reinforcement Learning algorithms, a modified Deep-Q Network (DQN) and Proximal Policy Optimization with Long-Short Term Memory networks (PPO LSTM), to execute trades and interact with each other in the market simulator.

Siyu Lin, Peter Beling
-
[ Video ]

We introduce Successor Landmarks: a novel framework for exploring large high-dimensional state-spaces and learning to navigate to distant goals. We exploit the capacity of successor features to capture local dynamics to define a novel similarity metric that measures the degree to which two state-action pairs lead to the same part of the state-space. With this, we are able to define a non-parametric landmark graph, which we use to facilitate exploration around "frontier" landmarks at the edge of the explored state-space. During evaluation, our landmark graph enables the agent to localize itself and execute trajectories it plans to long-horizon goals. Our experiments on ViZDoom show that Successor Landmarks enable a more robust navigation success rate across a range of goal-distances. Additionally, we demonstrate that our framework encourages exploration on a large map.

Sungryull Sohn, Jongwook Choi
-
Poster: Multi-task Reinforcement Learning with a Planning Quasi-Metric (Poster) [ Video ]
François Fleuret
-
[ Video ]

Attention mechanisms are generic inductive biases that have played a critical role in improving the state-of-the-art in supervised learning, unsupervised pre-training and generative modeling for multiple domains including vision, language and speech. However, they remain relatively under-explored for neural network architectures typically used in reinforcement learning (RL) from high dimensional inputs such as pixels. In this paper, we propose and study the effectiveness of augmenting a simple attention module in the convolutional encoder of an RL agent. Through experiments on the widely benchmarked DeepMind Control Suite environments, we demonstrate that our proposed module can (i) extract interpretable task-relevant information such as agent locations and movements without the need for data augmentations or contrastive losses; (ii) significantly improve the sample-efficiency and final performance of the agents. We hope our simple and effective approach will serve as a strong baseline for future research incorporating attention mechanisms in reinforcement learning and control.

Mandi Zhao, Qiyang Li, Aravind Srinivas, Ignasi Clavera Gilaberte, Kimin Lee
-
Poster: Quantifying Differences in Reward Functions (Poster) [ Video ]
Adam Gleave, Michael Dennis, Shane Legg, Stuart Russell
-
Poster: DERAIL: Diagnostic Environments for Reward And Imitation Learning (Poster) [ Video ]
Pedro Freire, Adam Gleave, Sam Toyer, Stuart Russell
-
Poster: Exploring Zero-Shot Emergent Communication in Embodied Multi-Agent Populations (Poster) [ Video ]
Kalesha Bullard, Douwe Kiela, Joelle Pineau
-
Poster: Unlocking the Potential of Deep Counterfactual Value Networks (Poster) [ Video ]
Noam Brown
-
Poster: FactoredRL: Leveraging Factored Graphs for Deep Reinforcement Learning (Poster) [ Video ]
-
Poster: Reusability and Transferability of Macro Actions for Reinforcement Learning (Poster) [ Video ]
Yi Hsiang Chang, Henry Kuo
-
Poster: Interactive Visualization for Debugging RL (Poster) [ Video ]
Shuby Deshpande, Jeff Schneider, Benjamin Eysenbach
-
Poster: A Deep Value-based Policy Search Approach for Real-world Vehicle Repositioning on Mobility-on-Demand Platforms (Poster) [ Video ]
ZHIWEI QIN, Hongtu zhu, Jieping Ye
-
Poster: FinRL: A Deep Reinforcement Learning Library for Automated Stock Trading in Quantitative Finance (Poster) [ Video ]
Xiao-Yang Liu, Hongyang Yang
-
Poster: Visual Imitation with Reinforcement Learning using Recurrent Siamese Networks (Poster) [ Video ]
Glen Berseth, Florian Golemo, Chris Pal
-
Poster: Learning Accurate Long-term Dynamics for Model-based Reinforcement Learning (Poster) [ Video ]
Nathan Lambert, Roberto Calandra
-
Poster: XLVIN: eXecuted Latent Value Iteration Nets (Poster) [ Video ]
Andreea Deac, Petar Veličković, Jian Tang
-
[ Video ]

In reinforcement learning, Return, which is the weighted accumulated future rewards, and Value, which is the expected return, serve as the objective that guides the learning of the policy. In classic RL, return is defined as the exponentially discounted sum of future rewards. One key insight is that there could be many feasible ways to define the form of the return function (and thus the value), from which the same optimal policy can be derived, yet these different forms might render dramatically different speeds of learning this policy. In this paper, we research how to modify the form of the return function to enhance the learning towards the optimal policy. We propose to use a general mathematical form for return function, and employ meta-learning to learn the optimal return function in an end-to-end manner. We test our methods on a specially designed maze environment and several Atari games, and our experimental results clearly indicate the advantages of automatically learning optimal return functions in reinforcement learning.

Qiwei Ye, Tie-Yan Liu
-
[ Video ]

We aim to help users communicate their intent to machines using flexible, adaptive interfaces that translate arbitrary user input into desired actions. In this work, we focus on assistive typing applications in which a user cannot operate a keyboard, but can instead supply other inputs, such as webcam images that capture eye gaze. Standard methods train a model on a fixed dataset of user inputs, then deploy a static interface that does not learn from its mistakes; in part, because extracting an error signal from user behavior can be challenging. We investigate a simple idea that would enable such interfaces to improve over time, with minimal additional effort from the user: online learning from implicit user feedback on the accuracy of the interface's actions. In the typing domain, we leverage backspaces as implicit feedback that the interface did not perform the desired action. We propose an algorithm called x-to-text (XT2) that trains a predictive model of this implicit feedback signal, and uses this model to fine-tune any existing, default interface for translating user input into actions that select words or characters. We evaluate XT2 through a small-scale online user study with 12 participants who type sentences by gazing at their desired words, and a large-scale observational study on handwriting samples from 60 users. The results show that XT2 learns to outperform a non-adaptive default interface, stimulates user co-adaptation to the interface, personalizes the interface to individual users, and can leverage offline data collected from the default interface to improve its initial performance and accelerate online learning.

Sid Reddy, Anca Dragan, Sergey Levine
-
[ Video ]

This paper presents a novel multi-step reinforcement learning algorithms, named Greedy Multi-Step Value Iteration (GM-VI), under off-policy setting. GM-VI iteratively approximates the optimal value function of a given environment using a newly proposed multi-step bootstrapping technique, in which the step size is adaptively adjusted along each trajectory according to a greedy principle. With the improved multi-step information propagation mechanism, we show that the resulted VI process is capable of safely learning from arbitrary behavior policy without additional off-policy correction. We further analyze the theoretical properties of the corresponding operator, showing that it is able to converge to globally optimal value function, with a rate faster than traditional Bellman Optimality Operator. Experiments reveal that the proposed methods is reliable, easy to implement and achieves state-of-the-art performance on a series of standard benchmark datasets.

Yuhui Wang, Xiaoyang Tan
-
[ Video ]

Learning to reach goal states and learning diverse skills through mutual information maximization have been proposed as principled frameworks for unsupervised reinforcement learning, allowing agents to acquire broadly applicable multi-task policies with minimal reward engineering. In this paper, we discuss how these two approaches — goal-conditioned RL (GCRL) and MI-based RL — can be generalized into a single family of methods, interpreting mutual information maximization and variational empowerment as representation learning methods that acquire functionally aware state representations for goal reaching. Starting from a simple observation that the standard GCRL is encapsulated by the optimization objective of variational empowerment, we can derive novel variants of GCRL and variational empowerment under a single, unified optimization objective, such as adaptive-variance GCRL and linear-mapping GCRL, and study the characteristics of representation learning each variant provides. Furthermore, through the lens of GCRL, we show that adapting powerful techniques from GCRL such as goal relabeling into the variational MI context as well as proper regularization on the variational posterior provides substantial gains in algorithm performance, and propose a novel evaluation metric named latent goal reaching (LGR) as an objective measure for evaluating empowerment algorithms akin to goal-based RL. Through principled mathematical derivations and careful experimental validations, our work lays a novel foundation from which representation learning can be evaluated and analyzed in goal-based RL.

Sergey Levine, Honglak Lee, Shixiang (Shane) Gu, Jongwook Choi
-
Poster: Robust Domain Randomised Reinforcement Learning through Peer-to-Peer Distillation (Poster) [ Video ]
Timothy Hospedales
-
Poster: ReaPER: Improving Sample Efficiency in Model-Based Latent Imagination (Poster) [ Video ]
Martin Bertran, Guillermo Sapiro, Mariano Phielipp
-
Poster: Model-Based Reinforcement Learning: A Compressed Survey (Poster) [ Video ]
Thomas Moerland
-
Poster: BeBold: Exploration Beyond the Boundary of Explored Regions (Poster) [ Video ]
Tianjun Zhang, Huazhe Xu, Xiaolong Wang, YI WU, Kurt Keutzer, Yuandong Tian
-
[ Video ]

A generalist robot must be able to complete a variety of tasks in its environment. One appealing way to specify each task is in terms of a goal observation. However, learning goal-reaching policies with reinforcement learning remains a challenging problem, particularly when rewards are not provided and distances in the observation space are not meaningful. Learned dynamics models are a promising approach for learning about the environment without rewards or task-directed data, but planning to reach goals with such a model requires a notion of functional similarity between observations and goal states. We present a self-supervised method for model-based visual goal reaching, which uses both a visual dynamics model as well as a dynamical distance function learned using model-free reinforcement learning. This approach trains entirely using offline, unlabeled data, making it practical to scale to large and diverse datasets. On several challenging robotic manipulation tasks with only offline, unlabeled data, we find that our algorithm compares favorably to prior model-based and model-free reinforcement learning methods. In ablation experiments, we additionally identify important factors for learning effective distances.

Suraj Nair, Frederik Ebert, Benjamin Eysenbach, Chelsea Finn, Sergey Levine
-
[ Video ]

For deep neural network accelerators, memory movement is both energetically expensive and can bound computation. Therefore, optimal mapping of tensors to memory hierarchies is critical to performance. The growing complexity of neural networks calls for automated memory mapping instead of manual heuristic approaches; yet the search space of neural network computational graphs have previously been prohibitively large. We introduce Evolutionary Graph Reinforcement Learning (EGRL), a method designed for large search spaces, that combines graph neural networks, reinforcement learning, and evolutionary search. A set of fast, stateless policies guide the evolutionary search to improve its sample-efficiency. We train and validate our approach directly on the Intel NNP-I chip for inference. EGRL outperforms policy-gradient, evolutionary search and dynamic programming baselines on BERT, ResNet-101 and ResNet-50. We additionally achieve 28-78% speed-up compared to the native NNP-I compiler on all three workloads.

Somdeb Majumdar, Avrech Ben-David, Santiago Miret, Shie Mannor, Tamir Hazan, Hanlin Tang
-
[ Video ]

In many deep reinforcement learning settings, when an agent takes an action, it repeats the same action a predefined number of times without observing the states until the next action-decision point. This technique of action repetition has several merits in training the agent, but the data between action-decision points (i.e., intermediate frames) are, in effect, discarded. Since the amount of training data is inversely proportional to the interval of action repeats, they can have a negative impact on the sample efficiency of training. In this paper, we propose a simple but effective approach to alleviate to this problem by introducing the concept of pseudo-actions. The key idea of our method is making the transition between action-decision points usable as training data by considering pseudo-actions. Pseudo-actions for continuous control tasks are obtained as the average of the action sequence straddling an action-decision point. For discrete control tasks, pseudo-actions are computed from learned action embeddings. This method can be combined with any model-free reinforcement learning algorithm that involves the learning of Q-functions. We demonstrate the effectiveness of our approach on both continuous and discrete control tasks in OpenAI Gym.

Yoshimasa Tsuruoka
-
[ Video ]

Actor-Critic (AC) algorithms are a powerful group of reinforcement learning (RL) algorithms. They can be seen as a two-player game between an actor, controlling the policy, that attempts to maximize future reward under the current value-function and a critic, controlling the value function, trying to fit the expected future value of the states in the transitions collected by the actor. While AC was from the beginning set in the context of two-level optimization a lot of recent methods are not paying attention to the temporal aspect of the interaction between actor and critic. We propose the analysis of the AC-game as a Stackelberg game, where one player is the leader and the other the follower, yielding two families of algorithms that connects and categorizes previous AC methods. It allows further to prove that a policy found in the Stackelberg equilibrium is at least as good as a policy in a Nash-equilibrium of actor and critic.

Robert Müller
-
[ Video ]

Effective planning in model-based reinforcement learning (MBRL) and model-predictive control (MPC) relies on the accuracy of the learned dynamics model. In many instances of MBRL and MPC, this model is assumed to be stationary and is periodically re-trained from scratch on state transition experience collected from the beginning of environment interactions. This implies that the time required to train the dynamics model - and the pause required between plan executions - grows linearly with the size of the collected experience. We argue that this is too slow for lifelong robot learning and propose HyperCRL, a method that continually learns the encountered dynamics in a sequence of tasks using task-conditional hypernetworks. Our method has three main attributes: first, it enables constant-time dynamics learning sessions between planning and only needs to store the most recent fixed-size portion of the state transition experience; second, it uses fixed-capacity hypernetworks to represent non-stationary and task-aware dynamics; third, it outperforms existing continual learning alternatives that rely on fixed-capacity networks, and does competitively with baselines that remember an ever increasing coreset of past experience. We show that HyperCRL is effective in continual model-based reinforcement learning in robot locomotion and manipulation scenarios, such as tasks involving pushing and door opening.

Yizhou Huang, Kevin Xie, Homanga Bharadhwaj, Florian Shkurti
-
Poster: Online Hyper-parameter Tuning in Off-policy Learning via Evolutionary Strategies (Poster) [ Video ]
Yunhao Tang, Krzysztof Choromanski
-
Poster: Policy Guided Planning in Learned Latent Space (Poster) [ Video ]
, Doina Precup
-
[ Video ]

OpenAI's Gym library contains a large, diverse set of environments that are useful benchmarks in reinforcement learning, under a single elegant Python API (with tools to develop new compliant environments). The introduction of this library has proven a watershed moment for the reinforcement learning community, because it created an accessible set of benchmark environments that everyone could use (including wrapper important existing libraries), and because a standardized API lets RL methods and environments from anywhere be trivially exchanged. This paper similarly introduces PettingZoo, a library of diverse sets of multi-agent environments under a single elegant Python API, with tools to easily make new compliant environments.

-
[ Video ]

We introduce DREAM, a regret-based deep reinforcement learning algorithm that converges to an equilibrium in imperfect-information multi-agent settings. Our primary contribution is an effective algorithm that, in contrast to other regret-based deep learning algorithms, does not require access to a perfect simulator of the game in order to achieve good performance. We show that DREAM empirically achieves state-of-the-art performance among model-free algorithms in popular benchmark games, and is even competitive with algorithms that do use a perfect simulator.

Eric Steinberger

Author Information

Pieter Abbeel (UC Berkeley & covariant.ai)

Pieter Abbeel is Professor and Director of the Robot Learning Lab at UC Berkeley [2008- ], Co-Director of the Berkeley AI Research (BAIR) Lab, Co-Founder of covariant.ai [2017- ], Co-Founder of Gradescope [2014- ], Advisor to OpenAI, Founding Faculty Partner AI@TheHouse venture fund, Advisor to many AI/Robotics start-ups. He works in machine learning and robotics. In particular his research focuses on making robots learn from people (apprenticeship learning), how to make robots learn through their own trial and error (reinforcement learning), and how to speed up skill acquisition through learning-to-learn (meta-learning). His robots have learned advanced helicopter aerobatics, knot-tying, basic assembly, organizing laundry, locomotion, and vision-based robotic manipulation. He has won numerous awards, including best paper awards at ICML, NIPS and ICRA, early career awards from NSF, Darpa, ONR, AFOSR, Sloan, TR35, IEEE, and the Presidential Early Career Award for Scientists and Engineers (PECASE). Pieter's work is frequently featured in the popular press, including New York Times, BBC, Bloomberg, Wall Street Journal, Wired, Forbes, Tech Review, NPR.

Chelsea Finn (Stanford)
Joelle Pineau (McGill University)

Joelle Pineau is an Associate Professor and William Dawson Scholar at McGill University where she co-directs the Reasoning and Learning Lab. She also leads the Facebook AI Research lab in Montreal, Canada. She holds a BASc in Engineering from the University of Waterloo, and an MSc and PhD in Robotics from Carnegie Mellon University. Dr. Pineau's research focuses on developing new models and algorithms for planning and learning in complex partially-observable domains. She also works on applying these algorithms to complex problems in robotics, health care, games and conversational agents. She serves on the editorial board of the Journal of Artificial Intelligence Research and the Journal of Machine Learning Research and is currently President of the International Machine Learning Society. She is a recipient of NSERC's E.W.R. Steacie Memorial Fellowship (2018), a Fellow of the Association for the Advancement of Artificial Intelligence (AAAI), a Senior Fellow of the Canadian Institute for Advanced Research (CIFAR) and in 2016 was named a member of the College of New Scholars, Artists and Scientists by the Royal Society of Canada.

David Silver (DeepMind)
Satinder Singh (University of Michigan)
Coline Devin (DeepMind)
Misha Laskin (UC Berkeley)
Kimin Lee (UC Berkeley)
Janarthanan Rajendran (University of Michigan)
Vivek Veeriah (University of Michigan)

More from the Same Authors