Timezone: »

Workshop
Deep Reinforcement Learning Workshop
Karol Hausman · Qi Zhang · Matthew Taylor · Martha White · Suraj Nair · Manan Tomar · Risto Vuorio · Ted Xiao · Zeyu Zheng · Manan Tomar

Fri Dec 09 08:25 AM -- 05:35 PM (PST) @ Virtual

In recent years, the use of deep neural networks as function approximators has enabled researchers to extend reinforcement learning techniques to solve increasingly complex control tasks. The emerging field of deep reinforcement learning has led to remarkable empirical results in rich and varied domains like robotics, strategy games, and multi-agent interactions. This workshop will bring together researchers working at the intersection of deep learning and reinforcement learning, and it will help interested researchers outside of the field gain a high-level view about the current state of the art and potential directions for future contributions.

 Fri 8:25 a.m. - 8:30 a.m. Opening Remarks 🔗 Fri 8:30 a.m. - 9:00 a.m. Tobias Gerstenberg (Invited Talk) Tobias Gerstenberg 🔗 Fri 9:00 a.m. - 9:15 a.m. ESCHER: ESCHEWING IMPORTANCE SAMPLING IN GAMES BY COMPUTING A HISTORY VALUE FUNCTION TO ESTIMATE REGRET (Poster)  link » Recent techniques for approximating Nash equilibria in very large games leverage neural networks to learn approximately optimal policies (strategies). One promis- ing line of research uses neural networks to approximate counterfactual regret minimization (CFR) or its modern variants. DREAM, the only current CFR-based neural method that is model free and therefore scalable to very large games, trains a neural network on an estimated regret target that can have extremely high variance due to an importance sampling term inherited from Monte Carlo CFR (MCCFR). In this paper we propose an unbiased model-free method that does not require any importance sampling. Our method, ESCHER, is principled and is guaranteed to converge to an approximate Nash equilibrium with high probability. We show that the variance of the estimated regret of ESCHER is orders of magnitude lower than DREAM and other baselines. We then show that ESCHER outperforms the prior state of the art—DREAM and neural fictitious self play (NFSP)—on a number of games and the difference becomes dramatic as game size increases. In the very large game of dark chess, ESCHER is able to beat DREAM and NFSP in a head-to-head competition over 90% of the time. Link » Stephen McAleer · Gabriele Farina · Marc Lanctot · Tuomas Sandholm 🔗 Fri 9:15 a.m. - 9:30 a.m. Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training (Poster)  link »    Reward and representation learning are two long-standing challenges for learning an expanding set of robot manipulation skills from sensory observations. Given the inherent cost and scarcity of in-domain, task-specific robot data, learning from large, diverse, offline human videos has emerged as a promising path towards acquiring a generally useful visual representation for control; however, how these human videos can be used for general-purpose reward learning remains an open question. We introduce $\textbf{V}$alue-$\textbf{I}$mplicit $\textbf{P}$re-training (VIP), a self-supervised pre-trained visual representation capable of generating dense and smooth reward functions for unseen robotic tasks. VIP casts representation learning from human videos as an offline goal-conditioned reinforcement learning problem and derives a self-supervised dual goal-conditioned value-function objective that does not depend on actions, enabling pre-training on unlabeled human videos. Theoretically, VIP can be understood as a novel implicit time contrastive objective that generates a temporally smooth embedding, enabling the value function to be implicitly defined via the embedding distance, which can then be used to construct the reward for any goal-image specified downstream task. Trained on large-scale Ego4D human videos and without any fine-tuning on in-domain, task-specific data, VIP's frozen representation can provide dense visual reward for an extensive set of simulated and real-robot tasks, enabling diverse reward-based visual control methods and significantly outperforming all prior pre-trained representations. Notably, VIP can enable simple, few-shot offline RL on a suite of real-world robot tasks with as few as 20 trajectories. Link » Jason Yecheng Ma · Shagun Sodhani · Dinesh Jayaraman · Osbert Bastani · Vikash Kumar · Amy Zhang 🔗 Fri 9:30 a.m. - 9:45 a.m. Is Model Ensemble Necessary? Model-based RL via a Single Model with Lipschitz Regularized Value Function (Poster)  link »    Probabilistic dynamics model ensemble is widely used in existing model-based reinforcement learning methods as it outperforms a single dynamics model in both asymptotic performance and sample efficiency. In this paper, we provide both practical and theoretical insights on the empirical success of the probabilistic dynamics model ensemble through the lens of Lipschitz continuity. We find that, for a value function, the stronger the Lipschitz condition is, the smaller the gap between the true dynamics- and learned dynamics-induced Bellman operators is, thus enabling the converged value function to be closer to the optimal value function. Hence, we hypothesize that the key functionality of the probabilistic dynamics model ensemble is to regularize the Lipschitz condition of the value function using generated samples. To validate this hypothesis, we devise two practical robust training mechanisms through computing the adversarial noise and regularizing the value network’s spectral norm to directly regularize the Lipschitz condition of the value functions. Empirical results show that combined with our mechanisms, model-based RL algorithms with a single dynamics model outperform those with ensemble of the probabilistic dynamics models. These findings not only support the theoretical insight, but also provide a practical solution for developing computationally efficient model-based RL algorithms. Link » Ruijie Zheng · Xiyao Wang · Huazhe Xu · Furong Huang 🔗 Fri 9:45 a.m. - 10:00 a.m. Offline Q-learning on Diverse Multi-Task Data Both Scales And Generalizes (Poster)  link »    The potential of offline reinforcement learning (RL) is that high-capacity models trained on large, heterogeneous datasets can lead to agents that generalize broadly, analogously to similar advances in vision and NLP. However, recent works argue that offline RL methods encounter unique challenges to scaling up model capacity. Drawing on the learnings from these works, we re-examine previous design choices and find that with appropriate choices: ResNets, cross-entropy based distributional backups, and feature normalization, offline Q-learning algorithms exhibit strong performance that scales with model capacity. Using multi-task Atari as a testbed for scaling and generalization, we train a single policy on 40 games with near-human performance using up-to 80 million parameter networks, finding that model performance scales favorably with capacity. In contrast to prior work, we extrapolate beyond dataset performance even when trained entirely on a large (400M transitions) but highly suboptimal dataset (51% human-level performance). Compared to return-conditioned supervised approaches, offline Q-learning scales similarly with model capacity and has better performance, especially when the dataset is suboptimal. Finally, we show that offline Q-learning with a diverse dataset is sufficient to learn powerful representations that facilitate rapid transfer to novel games and fast online learning on new variations of a training game, improving over existing state-of-the-art representation learning approaches. Link » Aviral Kumar · Rishabh Agarwal · XINYANG GENG · George Tucker · Sergey Levine 🔗 Fri 10:00 a.m. - 10:30 a.m. Jakob Foerster (Invited Talk) Jakob Foerster 🔗 Fri 11:00 a.m. - 11:30 a.m. Scientific Experiments in Reinforcement Learning (Opinion Talk) Scott Jordan 🔗 Fri 11:30 a.m. - 11:45 a.m. Transformers are Sample-Efficient World Models (Poster)  link » Deep reinforcement learning agents are notoriously sample inefficient, which considerably limits their application to real-world problems. Recently, many model-based methods have been designed to address this issue, with learning in the imagination of a world model being one of the most prominent approaches. However, while virtually unlimited interaction with a simulated environment sounds appealing, the world model has to be accurate over extended periods of time. Motivated by the success of Transformers in sequence modeling tasks, we introduce IRIS, a data-efficient agent that learns in a world model composed of a discrete autoencoder and an autoregressive Transformer. With the equivalent of only two hours of gameplay in the Atari 100k benchmark, IRIS achieves a mean human normalized score of 1.046, and outperforms humans on 10 out of 26 games, setting a new state of the art for methods without lookahead search. To foster future research on Transformers and world models for sample-efficient reinforcement learning, we release our codebase at this https URL. For the review process, we provide the code and visualizations in the supplementary materials. Link » Vincent Micheli · Eloi Alonso · François Fleuret 🔗 Fri 11:45 a.m. - 12:00 p.m. Scaling Laws for a Multi-Agent Reinforcement Learning Model (Poster)  link »    The recent observation of neural power-law scaling relations has made a significant impact in the field of deep learning. A substantial amount of attention has been dedicated as a consequence to the description of scaling laws, although mostly for supervised learning and only to a reduced extent for reinforcement learning frameworks. In this paper we present an extensive study of performance scaling for a cornerstone reinforcement learning algorithm, AlphaZero. On the basis of a relationship between Elo rating, playing strength and power-law scaling, we train AlphaZero agents on the games Connect Four and Pentago and analyze their performance. We find that player strength scales as a power law in neural network parameter count when not bottlenecked by available compute, and as a power of compute when training optimally sized agents. We observe nearly identical scaling exponents for both games. Combining the two observed scaling laws we obtain a power law relating optimal size to compute similar to the ones observed for language models. We find that the predicted scaling of optimal neural network size fits our data for both games. This scaling law implies that previously published state-of-the-art game-playing models are significantly smaller than their optimal size, given the respective compute budgets. We also show that large AlphaZero models are more sample efficient, performing better than smaller models with the same amount of training data. Link » Oren Neumann · Claudius Gros 🔗 Fri 12:00 p.m. - 12:30 p.m. Natasha Jaques (Opinion Talk) Natasha Jaques 🔗 Fri 1:30 p.m. - 2:00 p.m. The World is not Uniformly Distributed; Important Implications for Deep RL (Opinion Talk) Stephanie Chan 🔗 Fri 2:00 p.m. - 2:30 p.m. Amy Zhang (Invited Talk) Amy Zhang 🔗 Fri 3:00 p.m. - 3:30 p.m. Igor Mordatch (Invited Talk) Igor Mordatch 🔗 Fri 3:30 p.m. - 3:45 p.m. John Schulman (Implementation Talk) John Schulman 🔗 Fri 3:45 p.m. - 4:00 p.m. Danijar Hafner (Implementation Talk) Danijar Hafner 🔗 Fri 4:00 p.m. - 4:15 p.m. Kristian Hartikainen (Implementation Talk) Kristian Hartikainen 🔗 Fri 4:15 p.m. - 4:30 p.m. Ilya Kostrikov, Aviral Kumar (Implementation Talk) Ilya Kostrikov · Aviral Kumar 🔗 Fri 4:30 p.m. - 5:30 p.m. Panel Discussion 🔗 Fri 5:30 p.m. - 5:35 p.m. Closing Remarks 🔗 - Compositional Task Generalization with Modular Successor Feature Approximators (Poster)  link » Recently, the Successor Features and Generalized Policy Improvement (SF&GPI)framework has been proposed as a method for learning, composing and transferringpredictive knowledge and behavior. SF&GPI works by having an agent learnpredictive representations (SFs) that can be combined for transfer to new taskswith GPI. However, to be effective this approach requires state features that areuseful to predict, and these state-features are typically hand-designed. In thiswork, we present a novel neural network architecture, “Modular Successor FeatureApproximators” (MSFA), where modules both discover what is useful to predict,and learn their own predictive representations. We show that MSFA is able tobetter generalize compared to baseline architectures for learning SFs and a modularnetwork that discovers factored state representations. Link » Wilka Carvalho Carvalho 🔗 - Learning Dexterous Manipulation from Exemplar Object Trajectories and Pre-Grasps (Poster)  link » Learning diverse dexterous manipulation behaviors with assorted objects remains an open grand challenge. While policy learning methods offer a powerful avenue to attack this problem, they require extensive per-task engineering and algorithmic tuning. This paper seeks to escape these constraints, by developing a Pre-Grasp informed Dexterous Manipulation (PGDM) framework that generates diverse dexterous manipulation behaviors, without any task-specific reasoning or hyper-parameter tuning. At the core of PGDM is a well known robotics construct, pre-grasps (i.e. the hand-pose preparing for object interaction). This simple primitive is enough to induce efficient exploration strategies for acquiring complex dexterous manipulation behaviors. To exhaustively verify these claims, we introduce TCDM, a benchmark of 50 diverse manipulation tasks defined over multiple objects and dexterous manipulators. Tasks for TCDM are defined automatically using exem-plar object trajectories from various sources (animators, human behaviors, etc.), without any per-task engineering and/or supervision. Our experiments validate that PGDM’s exploration strategy, induced by a surprisingly simple ingredient (single pre-grasp pose), matches the performance of prior methods, which require expen-sive per-task feature/reward engineering, expert supervision, and hyper-parameter tuning. For animated visualizations, trained policies, and project code, please refer to https://sites.google.com/view/pregrasp/. Link » Sudeep Dasari · Vikash Kumar 🔗 - Neural All-Pairs Shortest Path for Reinforcement Learning (Poster)  link » Having an informative and dense reward function is an important requirement toefficiently solve goal-reaching tasks. While the natural reward for such tasks is abinary signal indicating success or failure, providing only a binary reward makeslearning very challenging given the sparsity of the feedback. Hence, introducingdense rewards helps to provide smooth gradients. However, these functions arenot readily available, and constructing them is difficult, as it often requires a lot oftime and domain-specific knowledge, and can unintentionally create spurious localminima. We propose a method that learns neural all-pairs shortest paths, used as adistance function to learn a policy for goal-reaching tasks, requiring zero domain-specific knowledge. In particular, our approach includes both a self-supervisedsignal from the temporal distance between state pairs of an episode, and a metric-based regularizer that leverages the triangle inequality for an additional connectivityinformation between state triples. This dynamical distance can be either used as acost function, or reshaped as a reward, and, differently from previous work, is fullyself-supervised, compatible with off-policy learning and robust to local minima. Link » Cristina Pinneri · Georg Martius · Andreas Krause 🔗 - VI2N: A Network for Planning Under Uncertainty based on Value of Information (Poster)  link »    Planning under uncertainty is an important issue in both neuroscience and computer science that as not been solved. By representing problems in Reinforcement Learning (RL) as Partially Observable Markov Decision Processes (POMDPs), they can be addressed from a theoretical perspective. While solving POMDPs is known to be NP-Hard, recent advances through deep learning have produced impressive neural network solvers, namely the Value Iteration Network (VIN) and the QMDP-Net. These solvers allow for increased learning and generalization to novel domains, but are not complete solutions to the RL problem. In this paper, we propose a new architecture, the VI$^2$N, a POMDP-solving neural network with a built-in Pairwise Heuristic that demonstrates the ability of imitation and reinforcement learning in novel domains where information gathering is necessary. This study shows the VI$^2$N to be at least as good as the state-of-the-art model on the tested environments. Link » Samantha Johnson · Michael Buice · Koosha Khalvati 🔗 - Efficient Multi-Horizon Learning for Off-Policy Reinforcement Learning (Poster)  link » Value estimates at multiple timescales can help create advanced discounting functions and allow agents to form more effective predictive models of their environment. In this work, we investigate learning over multiple horizons concurrently for off-policy deep reinforcement learning using an efficient architecture that combines a deeper network with the crucial components of Rainbow, a popular value-based off-policy algorithm. We use an advantage-based action selection method and our proposed agent learns over multiple horizons simultaneously while using either an exponential or hyperbolic discounting function to estimate the advantage that constitutes the acting policy. We test our approach on the Procgen benchmark, a collection of procedurally-generated environments, to demonstrate the effectiveness of this approach, specifically to evaluate the agent's performance in previously unseen scenarios. Link » Raja Farrukh Ali · Nasik Muhammad Nafi · Kevin Duong · William Hsu 🔗 - Analyzing the Sensitivity to Policy-Value Decoupling in Deep Reinforcement Learning Generalization (Poster)  link »    Existence of policy-value representation asymmetry negatively affects the generalization capability of the traditional actor-critic architecture that uses a shared representation of policy and value. Fully separated networks for policy and value avoid overfitting by addressing this representation asymmetry. However, two separate networks introduce high computational overhead. Previous work has also shown that partial separation can achieve the same level of generalization in most tasks while reducing this computational overhead. Thus, the questions arise Do we really need two separate networks? Is there any particular scenario where only full separation works? In this work, we attempt to analyze the generalization performance compared to the extent of decoupling. We compare four different degrees of subnetwork separation, namely: fully shared; early separated, lately separated, and fully separated on the RL generalization benchmark Procgen, a suite of 16 procedurally-generated environments. We show that unless the environment has a distinct or explicit source of value estimation, partial separation can easily capture the necessary policy-value representation asymmetry and achieve better generalization performance in unseen scenarios. Link » Nasik Muhammad Nafi · Raja Farrukh Ali · William Hsu 🔗 - Lagrangian Model Based Reinforcement Learning (Poster)  link »    One of the drawbacks of traditional RL algorithms has been their poor sample efficiency. In robotics, collecting large amounts of training data using actual robots is not practical. One approach to improve the sample efficiency of RL algorithms is model-based RL. Here we learn a model of the environment, essentially its transition dynamics and reward function, and use it to generate imaginary trajectories, which we then use to update the policy. Intuitively, learning better environment models should improve model-based RL. Recently there has been growing interest in developing better deep neural network based dynamics models for physical systems through better inductive biases. We investigate if such physics-informed dynamics models can also improve model-based RL. We focus on robotic systems undergoing rigid body motion. We utilize the structure of rigid body dynamics to learn Lagrangian neural networks and use them within a model-based RL algorithm. We find that our Lagrangian model-based RL approach achieves better average-return and sample efficiency compared to standard model-based RL as well as state-of-the-art model-free RL algorithms such as Soft-Actor-Critic, in complex environments. Link » Adithya Ramesh · Balaraman Ravindran 🔗 - Noisy Symbolic Abstractions for Deep RL: A case study with Reward Machines (Poster)  link »    Natural and formal languages provide an effective mechanism for humans to specify instructions and reward functions. We investigate how to generate policies via RL when reward functions are specified in a symbolic language captured by Reward Machines, an increasingly popular automaton-inspired structure. We are interested in the case where the mapping of environment state to the symbolic Reward Machine vocabulary is noisy. We formulate the problem of policy learning in Reward Machines with noisy symbolic abstractions as a special class of POMDP optimization problem, and investigate several methods to address the problem building on existing and new techniques, the latter focused on predicting Reward Machine state, rather than on grounding of individual symbols. We analyze these methods and evaluate them experimentally under varying degrees of uncertainty in the correct interpretation of the symbolic vocabulary. We verify the strength of our approach and the limitation of existing methods via an empirical investigation on both illustrative, toy domains and partially observable, deep RL domains. Link » Andrew Li · Zizhao Chen · Pashootan Vaezipoor · Toryn Klassen · Rodrigo Toro Icarte · Sheila McIlraith 🔗 - Towards A Unified Policy Abstraction Theory and Representation Learning Approach in Markov Decision Processes (Poster)  link »    Lying on the heart of intelligent decision-making systems, how policy is represented and optimized is a fundamental problem. The root challenge in this problem is the large scale and the high complexity of policy space, which exacerbates the difficulty of policy learning especially in real-world scenarios. Towards a desirable surrogate policy space, recently policy representation in a low-dimensional latent space has shown its potential in improving both the evaluation and optimization of policy. The key question involved in these studies is by what criterion we should abstract the policy space for desired compression and generalization. However, both the theory on policy abstraction and the methodology on policy representation learning are less studied in the literature. In this work, we make very first efforts to fill up the vacancy. First, we propose a unified policy abstraction theory, containing three types of policy abstraction associated to policy features at different levels. Then, we generalize them to three policy metrics that quantify the distance (i.e., similarity) of policies, for more convenient use in learning policy representation. Further, we propose a policy representation learning approach based on deep metric learning. For the empirical study, we investigate the efficacy of the proposed policy metrics and representations, in characterizing policy difference and conveying policy generalization respectively. Our experiments are conducted in both policy optimization and evaluation problems, containing trust-region policy optimization (TRPO), diversity-guided evolution strategy (DGES) and off-policy evaluation (OPE). Somewhat naturally, the experimental results indicate that there is no a universally optimal abstraction for all downstream learning problems; while the influence-irrelevance policy abstraction can be a generally preferred choice. Link » Min Zhang · Hongyao Tang · Jianye Hao · YAN ZHENG 🔗 - Informative rewards and generalization in curriculum learning (Poster)  link »    Curriculum learning (CL) is central to human learning as much as reinforcement learning (RL) itself. However, CL agents trained using RL with function approximation produce limited generalization to later tasks in the curriculum. One contributing factor might be exploration itself. Exploration often induces the agent to visit task-irrelevant states, leading to training-induced non-stationarities. Thus, the value/policy networks utilize their limited capacity to fit targets for these irrelevant states. Consequently, this results in impaired generalization to later tasks. First, we propose to use an \emph{online} distillation method to alleviate this problem in CL. We show that one can use a learned, informative reward function to minimize exploration and, consequently, non-stationarities during the distillation process. Second, we show that minimizing exploration improves capacity utilization as measured by feature rank. Finally, we illuminate the links between exploration, non-stationarity, capacity, and generalization in the CL setting. In conclusion, we see this as a crucial step toward improving the generalization of Deep RL methods in Curriculum learning. Link » Rahul Siripurapu · Vihang Patil · Kajetan Schweighofer · Marius-Constantin Dinu · Markus Holzleitner · Hamid Eghbalzadeh · Luis Ferro · Thomas Schmied · Michael Kopp · Sepp Hochreiter 🔗 - Generalizable Point Cloud Reinforcement Learning for Sim-to-Real Dexterous Manipulation (Poster)  link » We propose a sim-to-real framework for dexterous manipulation which can generalize to new objects of the same category in the real world. The key of our framework is to train the manipulation policy with point cloud inputs and dexterous hands. We propose two new techniques to enable joint learning on multiple objects and sim-to-real generalization: (i) using imagined hand point clouds as augmented inputs; and (ii) designing novel contact-based rewards. We empirically evaluate our method using an Allegro Hand to grasp novel objects in both simulation and real world. To the best of our knowledge, this is the first policy learning-based framework that achieves such generalization results with dexterous hands. Our project page is available at \url{http://dexpc.github.io}. Link » Yuzhe Qin · Binghao Huang · Zhao-Heng Yin · Hao Su · Xiaolong Wang 🔗 - CLUTR: Curriculum Learning via Unsupervised Task Representation Learning (Poster)  link »    Reinforcement Learning (RL) algorithms are often known for sample inefficiency and difficult generalization. Recently, Unsupervised Environment Design (UED) emerged as a new paradigm for zero-shot generalization by simultaneously learning a task distribution and agent policies on the sampled tasks. This is a non-stationary process where the task distribution evolves along with agent policies; creating an instability over time. While past works demonstrated the potential of such approaches, sampling effectively from the task space remains an open challenge, bottlenecking these approaches. To this end, we introduce CLUTR: a novel curriculum learning algorithm that decouples task representation and curriculum learning into a two-stage optimization. It first trains a recurrent variational autoencoder on randomly generated tasks to learn a latent task manifold. Next, a teacher agent creates a curriculum by optimizing a minimax REGRET-based objective on a set of latent tasks sampled from this manifold. By keeping the task manifold fixed, we show that CLUTR successfully overcomes the non-stationarity problem and improves stability. Our experimental results show CLUTR outperforms PAIRED, a principled and popular UED method, in terms of generalization and sample efficiency in the challenging CarRacing and navigation environments: showing an 18x improvement on the F1 CarRacing benchmark. CLUTR also performs comparably to the non-UED state-of-the-art for CarRacing, outperforming it in nine of the 20 tracks. CLUTR also achieves a 33% higher solved rate than PAIRED on a set of 18 out-of-distribution navigation tasks. Link » Abdus Salam Azad · Izzeddin Gur · Aleksandra Faust · Pieter Abbeel · Ion Stoica 🔗 - The Emphatic Approach to Average-Reward Policy Evaluation (Poster)  link »    Off-policy policy evaluation has been a longstanding problem in reinforcement learning. This paper looks at this problem under the average-reward formulation with function approximation. Differential temporal-difference (TD) learning has been proposed recently and has shown great potential compared to previous average-reward learning algorithms. In the tabular setting, off-policy differential TD is guaranteed to converge. However, the convergence guarantee cannot be carried through the function approximation setting. To address the instability of off-policy differential TD, we investigate the emphatic approach proposed for the discounted formulation. Specifically, we introduce average emphatic trace for average-reward off-policy learning. We further show that without any variance reduction techniques, the new trace suffers from slow learning due to high variance of importance sampling ratios. Finally, we show that differential emphatic TD($\beta$), extended from the discounted setting, can save us from the high variance while introducing bias. Experimental results on a counterexample show that differential emphatic TD($\beta$) performs better than an existing competitive off-policy algorithm. Link » Jiamin He · Yi Wan · Rupam Mahmood 🔗 - Learning Exploration Policies with View-based Intrinsic Rewards (Poster)  link »    Efficient exploration in sparse-reward tasks is one of the biggest challenges in deep reinforcement learning. Common approaches introduce intrinsic rewards to motivate exploration. For example, visitation count and prediction-based curiosity utilize some measures of novelty to drive the agent to visit novel states in the environment. However, in partially-observable environments, these methods can easily be misled by relatively “novel” or noisy observations and get stuck around them. Motivated by humans’ exploration behavior of seeing around the environment to get information and avoid unnecessary actions, we consider enlarging the agent’s view area for efficient knowledge acquisition of the environment. In this work, we propose a novel intrinsic reward combining two components: the view-based bonus for ample view coverage and the classical count-based bonus for novel observation discovery. The resulting method, ViewX, achieves state-of-the-art performance on the 12 most challenging procedurally-generated tasks on MiniGrid. Additionally, ViewX efficiently learns an exploration policy in the task-agnostic setting, which generalizes well to unseen environments. When exploring new environments on MiniGrid and Habitat, our learned policy significantly outperforms the baselines in terms of scene coverage and extrinsic reward. Link » Yijie Guo · Yao Fu · Run Peng · Honglak Lee 🔗 - Scaling Covariance Matrix Adaptation MAP-Annealing to High-Dimensional Controllers (Poster)  link »    Pre-training a diverse set of robot controllers in simulation has enabled robots to adapt online to damage in robot locomotion tasks. However, finding diverse, high-performing controllers requires specialized hardware and extensive tuning of a large number of hyperparameters. On the other hand, the Covariance Matrix Adaptation MAP-Annealing algorithm, an evolution strategies (ES)-based quality diversity algorithm, does not have these limitations and has been shown to achieve state-of-the-art performance in standard benchmark domains. However, CMA-MAE cannot scale to modern neural network controllers due to its quadratic complexity. We leverage efficient approximation methods in ES to propose three new CMA-MAE variants that scale to very high dimensions. Our experiments show that the variants outperform ES-based baselines in benchmark robotic locomotion tasks, while being comparable with state-of-the-art deep reinforcement learning-based quality diversity algorithms. Source code and videos are available in the supplementary material. Link » Bryon Tjanaka · Matthew Fontaine · Aniruddha Kalkar · Stefanos Nikolaidis 🔗 - Policy Aware Model Learning via Transition Occupancy Matching (Poster)  link »    Model-based reinforcement learning (MBRL) is an effective paradigm for sample-efficient policy learning. The pre-dominant MBRL strategy iteratively learns the dynamics model by performing maximum likelihood (MLE) on the entire replay buffer and trains the policy using fictitious transitions from the learned model. Given that not all transitions in the replay buffer are equally informative about the task or the policy's current progress, this MLE strategy cannot be optimal and bears no clear relation to the standard RL objective. In this work, we propose Transition Occupancy Matching (TOM), a policy-aware model learning algorithm that maximizes a lower bound on the standard RL objective. TOM learns a policy-aware dynamics model by minimizing an $f$-divergence between the distribution of transitions that the current policy visits in the real environment and in the learned model; then, the policy can be updated using any pre-existing RL algorithm with log-transformed reward. TOM's practical implementation builds on tools from dual reinforcement learning and learns the optimal transition occupancy ratio between the current policy and the replay buffer; leveraging this ratio as importance weights, TOM amounts to performing MLE model learning on the correct, policy aware transition distribution. Crucially, TOM is a model learning sub-routine and is compatible with any backbone MBRL algorithm that implements MLE-based model learning. On the standard set of Mujoco locomotion tasks, we find TOM improves the learning speed of a standard MBRL algorithm and can reach the same asymptotic performance with as much as 50% fewer samples. Link » Jason Yecheng Ma · Kausik Sivakumar · Osbert Bastani · Dinesh Jayaraman 🔗 - On The Fragility of Learned Reward Functions (Poster)  link »    Reward functions are notoriously difficult to specify, especially for tasks with complex goals. Reward learning approaches attempt to infer reward functions from human feedback and preferences. Prior works on reward learning mainly focus on achieving high final performance for agents trained alongside the reward function. However, many of these works fail to investigate whether the resulting learned reward model accurately captures the intended behavior. In this work, we focus on the $\textit{relearning}$ failures of learned reward models. We demonstrate that when they are reused to train randomly initialized policies by designing experiments on both tabular and continuous control environments. We found that the severity of relearning failure might be sensitive to changes in reward model design and the trajectory dataset. Finally, we discussed the potential limitations of our methods and emphases the need for more retraining-based evaluations in the literature. Link » Lev McKinney · Yawen Duan · Adam Gleave · David Krueger 🔗 - Temporary Goals for Exploration (Poster)  link »    Exploration has always been a crucial aspect of reinforcement learning. When facing long horizon sparse reward environments modern methods still struggle with effective exploration and generalize poorly. In the multi-goal reinforcement learning setting, out-of-distribution goals might appear similar to the achieved ones, but the agent can never accurately assess its ability to achieve them without attempting them. To enable faster exploration and improve generalization, we propose an exploration method that lets the agent temporarily pursue the most meaningful nearby goal. We demonstrate the performance of our method through experiments in four multi-goal continuous navigation environments including a 2D PointMaze, an AntMaze, and a discrete multi-goal foraging world. Link » Haoyang Xu · Jimmy Ba · Silviu Pitis · Harris Chan 🔗 - Revisiting Bellman Errors for Offline Model Selection (Poster)  link » It is well-known that the empirical Bellman errors are poor predictors of value function estimation accuracy and policy performance. This has led researchers to abandon offline model selection procedures based on Bellman errors and instead focus on directly estimating the expected return under different policies of interest. The problem with this approach is that it can be very difficult to use an offline dataset generated by one policy to estimate the expected returns of a different policy. In contrast, we argue that Bellman errors can be useful for offline model selection, and that the discouraging results in past literature has been due to estimating and utilizing them incorrectly. We propose a new algorithm, $\textit{Supervised Bellman Validation}$, that estimates the expected squared Bellman error better than the empirical Bellman errors. We demonstrate the relative merits of our method over competing methods through both theoretical results and empirical results on offline datasets from the Atari benchmark. We hope that our results will challenge current attitudes and spur future research into Bellman errors and their utility in offline model selection. Link » Joshua Zitovsky · Daniel de Marchi · Rishabh Agarwal · Michael Kosorok 🔗 - Unleashing The Potential of Data Sharing in Ensemble Deep Reinforcement Learning (Poster)  link »    This work studies a crucial but often overlooked element of ensemble methods in deep reinforcement learning: data sharing between ensemble members. We show that data sharing enables peer learning, a powerful learning process in which individual agents learn from each other's experience to significantly improve their performance. When given access to the experience of other ensemble members, even the worst agent can match or outperform the previously best agent, triggering a virtuous circle. However, we show that peer learning can be unstable when the agents' ability to learn is impaired due to overtraining on early data. We thus employ the recently proposed solution of periodic resets and show that it ensures effective peer learning. We perform extensive experiments on continuous control tasks from both dense states and pixels to demonstrate the strong effect of peer learning and its interaction with resets. Link » Zhixuan Lin · Pierluca D'Oro · Evgenii Nikishin · Aaron Courville 🔗 - What Makes Certain Pre-Trained Visual Representations Better for Robotic Learning? (Poster)  link » Deep learning for robotics is data-intensive, but collecting high-quality robotics data at scale is prohibitively expensive. One approach to mitigate this is to leverage visual representations pre-trained on relatively abundant non-robotic datasets. So far, existing works have focused on proposing pre-training strategies and assessing them via ablation studies, giving high-level knowledge of how pre-training design choices affect downstream performance. However, the significant gap in data and objective between the two stages motivates a more detailed understanding of what properties of better pre-trained visual representations enable their comparative advantage. In this work, we empirically analyze the representations of robotic manipulation data from several standard benchmarks under a variety of pre-trained models, correlating key metrics of the representations with closed-loop task performance after behavior cloning. We find evidence that suggests our proposed metrics have substantive predictive power for downstream robotic learning. Link » Kyle Hsu · Tyler Lum · Ruohan Gao · Shixiang (Shane) Gu · Jiajun Wu · Chelsea Finn 🔗 - Curiosity in Hindsight (Poster)  link »    Consider the problem of exploration in sparse-reward or reward-free environments, such as Montezuma's Revenge. The curiosity-driven paradigm dictates an intuitive technique: At each step, the agent is rewarded for how much the realized outcome differs from their predicted outcome. However, using predictive error as intrinsic motivation is prone to fail in stochastic environments, as the agent may become hopelessly drawn to high-entropy areas of the state-action space, such as a noisy TV. Therefore it is important to distinguish between aspects of world dynamics that are inherently predictable (for which errors reflect epistemic uncertainty) and aspects that are inherently unpredictable (for which errors reflect aleatoric uncertainty): The former should constitute a source of intrinsic reward, whereas the latter should not. In this work, we study a natural solution derived from structural causal models of the world: Our key idea is to learn representations of the future that capture precisely the unpredictable aspects of each outcome---not any more, not any less---which we use as additional input for predictions, such that intrinsic rewards do vanish in the limit. First, we propose incorporating such hindsight representations into the agent's model to disentangle "noise" from "novelty", yielding Curiosity in Hindsight: a simple and scalable generalization of curiosity that is robust to all types of stochasticity. Second, we implement this framework as a drop-in modification of any prediction-based exploration bonus, and instantiate it for the recently introduced BYOL-Explore algorithm as a prime example, resulting in the noise-robust "BYOL-Hindsight". Third, we illustrate its behavior under various stochasticities in a grid world, and find improvements over BYOL-Explore in hard-exploration Atari games with sticky actions. Importantly, we show state-of-the-art results in exploring Montezuma's Revenge with sticky actions, while preserving performance in the non-sticky setting. Link » Daniel Jarrett · Corentin Tallec · Florent Altché · Thomas Mesnard · Remi Munos · Michal Valko 🔗 - Train Offline, Test Online: A Real Robot Learning Benchmark (Poster)  link »    Three challenges limit the progress of robot learning research: robots are expensive (few labs can participate), everyone uses different robots (findings do not generalize across labs), and we lack internet-scale robotics data. We take on these challenges via a new benchmark: Train Offline, Test Online (TOTO). TOTO provides remote users with access to shared robots for evaluating methods on common tasks and an open-source dataset of these tasks for offline training. Its manipulation task suite requires challenging generalization to unseen objects, positions, and lighting. We present initial results on TOTO comparing five pretrained visual representations and four offline policy learning baselines, remotely contributed by five institutions. The real promise of TOTO, however, lies in the future: we release the benchmark for additional submissions from any user, enabling easy, direct comparison to several methods without the need to obtain hardware or collect data. Link » Gaoyue Zhou · Victoria Dean · Mohan Kumar Srirama · Aravind Rajeswaran · Jyothish Pari · Kyle Hatch · Aryan Jain · Tianhe Yu · Pieter Abbeel · Lerrel Pinto · Chelsea Finn · Abhinav Gupta 🔗 - A Framework for Predictable Actor-Critic Control (Poster)  link »    Reinforcement learning (RL) algorithms commonly provide a one-action plan per time step. Doing this allows the RL agent to quickly adapt and respond to stochastic environments yet it restricts the ability to predict the agent's future behavior. This paper proposes an actor-critic framework that predicts and follows an $n$-step plan. Committing to the next $n$ actions presents a trade-off between behavior predictability and reduced performance. In order to balance this trade-off, a dynamic plan-following criteria is proposed for determining when it is too costly to follow the preplanned actions and a replanning procedure should be initiated instead. Performance degradation bounds are presented for the proposed criteria when assuming access to accurate state-action values. Experimental results, using several robotics domains, suggest that the performance bounds are also satisfied in the general (approximation) case on expectancy. Additionally, the experimental section presents a study of the predictability versus performance degradation trade-off and demonstrates the benefits of applying the proposed plan-following criteria. Link » Josiah Coad · James Ault · Jeff Hykin · Guni Sharon 🔗 - Ensemble based uncertainty estimation with overlapping alternative predictions (Poster)  link »    A reinforcement learning model will predict an action in whatever state it is - even if there is no distinct outcome due to unseen states the model may not indicate that. Different methods for uncertainty estimation can be used to indicate this. Although, uncertainty estimation is a well understood approach in AI, the overlap of effects like alternative possible predictions (multiple feasible actions in a given state) in reinforcement learning is not so clear and to our knowledge, not so well documented in current literature. In this work we investigate uncertainty estimation on simplified scenarios in a gridworld environment. Using model ensemble based uncertainty estimation we propose an algorithm based on action count variance to deal with discrete action spaces and delta to ID action variance calculation to handle overlapping alternative predictions. To visualize the expressiveness we create heatmaps for different ID and ODD scenarios on gridworlds and propose an indicator for uncertainty. We can show, that the method will indicate potentially unsafe states when the agent is near unseen elements in the scenarios (OOD) and can distinguish between OOD and overlapping alternative predictions. Link » Dirk Eilers · Felippe Schmoeller Roza · Karsten Roscher 🔗 - Offline Reinforcement Learning on Real Robot with Realistic Data Sources (Poster)  link »    Offline Reinforcement Learning (ORL) provides a framework to train control policies from fixed sub-optimal datasets, making it suitable for safety-critical applications like robotics. Despite significant algorithmic advances and benchmarking in simulation, the evaluation of ORL algorithms on real-world robot learning tasks has been limited. Since real robots are sensitive to details like sensor noises, reset conditions, demonstration sources, and test time distribution, it remains a question whether ORL is a competitive solution to real robotic challenges and what would characterize such tasks. We aim to address this deficiency through an empirical study of representative ORL algorithms on four table-top manipulation tasks using a Franka-Panda robot arm. Our evaluation finds that for scenarios with sufficient in-domain data of high quality, specialized ORL algorithms can be competitive with the behavior cloning approach. However, for scenarios that require out-of-distribution generalization or task transfer, ORL algorithms can learn and generalize from offline heterogeneous datasets and outperform behavior cloning. Project URL: https://sites.google.com/view/real-orl-anon Link » Gaoyue Zhou · Liyiming Ke · Siddhartha Srinivasa · Abhinav Gupta · Aravind Rajeswaran · Vikash Kumar 🔗 - Feasible Adversarial Robust Reinforcement Learning for Underspecified Environments (Poster)  link »    Robust reinforcement learning (RL) considers the problem of learning policies that perform well in the worst case among a set of possible environment parameter values. In real-world environments, choosing the set of possible values for robust RL can be a difficult task. When that set is specified too narrowly, the agent will be left vulnerable to reasonable parameter values unaccounted for. When specified too broadly, the agent will be too cautious. In this paper, we propose Feasible Adversarial Robust RL (FARR), a novel problem formulation and objective for automatically determining the set of environment parameter values over which to be robust. FARR implicitly defines the set of feasible parameter values as those on which an agent could achieve a benchmark reward given enough training resources. By formulating this problem as a two-player zero-sum game, optimizing the FARR objective jointly produces an adversarial distribution over parameter values with feasible support and a policy robust over this feasible parameter set. We demonstrate that approximate Nash equilibria for this objective can be found using a variation of the PSRO algorithm. Furthermore, we show that an optimal agent trained with FARR is more robust to feasible adversarial parameter selection than with existing minimax, domain-randomization, and regret objectives in a parameterized gridworld and three MuJoCo control environments. Link » JB Lanier · Stephen McAleer · Pierre Baldi · Roy Fox 🔗 - Training Equilibria in Reinforcement Learning (Poster)  link » In partially observable environments, reinforcement learning algorithms such as policy gradient and Q-learning may have multiple equilibria---policies that are stable under further training---and can converge to equilibria that are strictly suboptimal. Prior work blames insufficient exploration, but suboptimal equilibria can arise despite full exploration and other favorable circumstances like a flexible policy parametrization.We show theoretically that the core problem is that in partially observed environments, an agent's past actions induce a distribution on hidden states.Equipping the policy with memory helps it model the hidden state and leads to convergence to a higher reward equilibrium, \emph{even when there exists a memoryless optimal policy}.Experiments show that policies with insufficient memory tend to learn to use the environment as auxiliary memory, and parameter noise helps policies escape suboptimal equilibria. Link » Lauro Langosco · David Krueger · Adam Gleave 🔗 - A Unified Approach to Reinforcement Learning, Quantal Response Equilibria, and Two-Player Zero-Sum Games (Poster)  link » Algorithms designed for single-agent reinforcement learning (RL) generally fail to converge to equilibria in two-player zero-sum (2p0s) games. On the other hand, game-theoretic algorithms for approximating Nash and regularized equilibria in 2p0s games are not typically competitive for RL and can be difficult to scale. As a result, algorithms for these two cases are generally developed and evaluated separately. In this work, we show that a single algorithm---a simple extension to mirror descent with proximal regularization that we call magnetic mirror descent (MMD)---can produce strong results in both settings, despite their fundamental differences. From a theoretical standpoint, we prove that MMD converges linearly to quantal response equilibria (i.e., entropy regularized Nash equilibria) in extensive-form games---this is the first time linear convergence has been proven for a first order solver. Moreover, applied as a tabular Nash equilibrium solver via self-play, we show empirically that MMD produces results competitive with CFR in both normal-form and extensive-form games---this is the first time that a standard RL algorithm has done so. Furthermore, for single-agent deep RL, on a small collection of Atari and Mujoco tasks, we show that MMD can produce results competitive with those of PPO. Lastly, for multi-agent deep RL, we show MMD can outperform NFSP in 3x3 Abrupt Dark Hex. Link » Samuel Sokota · Ryan D'Orazio · J. Zico Kolter · Nicolas Loizou · Marc Lanctot · Ioannis Mitliagkas · Noam Brown · Christian Kroer 🔗 - Replay Buffer With Local Forgetting for Adaptive Deep Model-Based Reinforcement Learning (Poster)  link »    One of the key behavioral characteristics used in neuroscience to determine whether the subject of study---be it a rodent or a human---exhibits model-based learning is effective adaptation to local changes in the environment. In reinforcement learning, however, recent work has shown that modern deep model-based reinforcement-learning (MBRL) methods adapt poorly to such changes. An explanation for this mismatch is that MBRL methods are typically designed with sample-efficiency on a single task in mind and the requirements for effective adaptation are substantially higher, both in terms of the learned world model and the planning routine. One particularly challenging requirement is that the learned world model has to be sufficiently accurate throughout relevant parts of the state-space. This is challenging for deep-learning-based world models due to catastrophic forgetting. And while a replay buffer can mitigate the effects of catastrophic forgetting, the traditional first-in-first-out replay buffer precludes effective adaptation due to maintaining stale data. In this work, we show that a conceptually simple variation of this traditional replay buffer is able to overcome this limitation. By removing only samples from the buffer from the local neighbourhood of the newly observed samples, deep world models can be built that maintain their accuracy across the state-space, while also being able to effectively adapt to changes in the reward function. We demonstrate this by applying our replay-buffer variation to the classical Dyna method, as well as to recent methods such as PlaNet and DreamerV2, showing for the first time that deep model-based methods are able to achieve effective adaptation. Link » Ali Rahimi-Kalahroudi · Janarthanan Rajendran · Ida Momennejad · Harm Van Seijen · Sarath Chandar 🔗 - Confidence-Conditioned Value Functions for Offline Reinforcement Learning (Poster)  link » Offline reinforcement learning (RL) promises the ability to learn effective policies solely using existing, static datasets, without any costly online interaction. To do so, offline RL methods must handle distributional shift between the dataset and the learned policy. The most common approach is to learn conservative, or lower-bound, value functions, which underestimate the return of out-of-distribution (OOD) actions. However, such methods exhibit one notable drawback: policies optimized on such value functions can only behave according to a fixed, possibly suboptimal, degree of conservatism. However, this can be alleviated if we instead are able to learn policies for varying degrees of conservatism at training time and devise a method to dynamically choose one of them during evaluation. To do so, in this work, we propose learning value functions that additionally condition on the degree of conservatism, which we dub confidence-conditioned value functions. We derive a new form of a Bellman backup that simultaneously learns Q-values for any degree of confidence with high probability. By conditioning on confidence, our value functions enable adaptive strategies during online evaluation by controlling for confidence level using the history of observations thus far. This approach can be implemented in practice by conditioning the Q-function from existing conservative algorithms on the confidence. We theoretically show that our learned value functions produce conservative estimates of the true value at any desired confidence. Finally, we empirically show that our algorithm outperforms existing conservative offline RL algorithms on multiple discrete control domains. Link » Joey Hong · Aviral Kumar · Sergey Levine 🔗 - Aggressive Q-Learning with Ensembles: Achieving Both High Sample Efficiency and High Asymptotic Performance (Poster)  link »    Recent advances in model-free deep reinforcement learning (DRL) show that simple model-free methods can be highly effective in challenging high-dimensional continuous control tasks. In particular, Truncated Quantile Critics (TQC) achieves state-of-the-art asymptotic training performance on the MuJoCo benchmark with a distributional representation of critics; and Randomized Ensemble Double Q-Learning (REDQ) achieves high sample efficiency that is competitive with state-of-the-art model-based methods using a high update-to-data ratio and target randomization. In this paper, we propose a novel model-free algorithm, Aggressive Q-Learning with Ensembles (AQE), which improves the sample-efficiency performance of REDQ and the asymptotic performance of TQC, thereby providing overall state-of-the-art performance during all stages of training. Moreover, AQE is very simple, requiring neither distributional representation of critics nor target randomization. The effectiveness of AQE is further supported by our extensive experiments, ablations, and theoretical results. Link » Yanqiu Wu · Xinyue Chen · Che Wang · Yiming Zhang · Keith Ross 🔗 - Integrating Episodic and Global Bonuses for Efficient Exploration (Poster)  link » Exploration in environments which differ across episodes has received increasing attention in recent years. Current methods use some combination of global novelty bonuses, computed using the agent's entire training experience, and episodic novelty bonuses, computed using only experience from the current episode. However, the use of these two types of bonuses has been ad-hoc and poorly understood. In this work, we first shed light on the behavior these two kinds of bonuses on hard exploration tasks through easily interpretable examples. We find that the two types of bonuses succeed in different settings, with episodic bonuses being most effective when there is little shared structure between environments and global bonuses being effective when more structure is shared. We also find that combining the two bonuses leads to more robust behavior across both of these settings. Motivated by these findings, we then investigate different algorithmic choices for defining and combining function approximation-based global and episodic bonuses. This results in a new algorithm which sets a new state of the art across 18 tasks from the MiniHack suite used in prior work. Link » Mikael Henaff · Minqi Jiang · Roberta Raileanu 🔗 - Deconfounded Imitation Learning (Poster)  link »    Standard imitation learning can fail when the expert demonstrators have different sensory inputs than the imitating agent. This partial observability gives rise to hidden confounders in the causal graph, which lead to the failure to imitate. We break down the space of confounded imitation learning problems and identify three settings with different data requirements in which the correct imitation policy can be identified. We then introduce an algorithm for deconfounded imitation learning, which trains an inference model jointly with a latent-conditional policy. At test time, the agent alternates between updating its belief over the latent and acting under the belief. We show in theory and practice that this algorithm converges to the correct interventional policy, solves the confounding issue, and can under certain assumptions achieve an asymptotically optimal imitation performance. Link » Risto Vuorio · Pim de Haan · Johann Brehmer · Hanno Ackermann · Daniel Dijkman · Taco Cohen 🔗 - ABC: Adversarial Behavioral Cloning for Offline Mode-Seeking Imitation Learning (Poster)  link »    Given a dataset of interactions with an environment of interest, a viable method to extract an agent policy is to estimate the maximum likelihood policy indicated by this data. This approach is commonly referred to as behavioral cloning (BC). In this work, we describe a key disadvantage of BC that arises due to the maximum likelihood objective function; namely that BC is mean-seeking with respect to the state-conditional expert action distribution when the learner's policy is represented with a Gaussian. To address this issue, we develop a modified version of BC, Adversarial Behavioral Cloning (ABC), that exhibits mode-seeking behavior by incorporating elements of GAN (generative adversarial network) training. We evaluate ABC on toy domains and a domain based on Hopper from the DeepMind Control suite, and show that it outperforms BC by being mode-seeking in nature. Link » Eddy Hudson · Ishan Durugkar · Garrett Warnell · Peter Stone 🔗 - Human-AI Coordination via Human-Regularized Search and Learning (Poster)  link »    We consider the problem of making AI agents that collaborate well with humans in partially observable fully cooperative environments given datasets of human behavior. Inspired by piKL, a human-data-regularized search method that improves upon a behavioral cloning policy without diverging far away from it, we develop a three-step algorithm that achieve strong performance in coordinating with real humans in the Hanabi benchmark. We first use a regularized search algorithm and behavioral cloning to produce a better human model that captures diverse skill levels. Then, we integrate the policy regularization idea into reinforcement learning to train a human-like best response to the human model. Finally, we apply regularized search on top of the best response policy at test time to handle out-of-distribution challenges when playing with humans. We evaluate our method in two large scale experiments with humans. First, we show that our method outperforms experts when playing with a group of diverse human players in ad-hoc teams. Second, we show that our method beats a vanilla best response to behavioral cloning baseline by having experts play repeatedly with the two agents. Link » Hengyuan Hu · David Wu · Adam Lerer · Jakob Foerster · Noam Brown 🔗 - Proto-Value Networks: Scaling Representation Learning with Auxiliary Tasks (Poster)  link » Auxiliary tasks improve the representations learned by deep reinforcement learning agents. Analytically, their effect is reasonably well-understood; in practice, how-ever, their primary use remains in support of a main learning objective, rather than as a method for learning representations. This is perhaps surprising given that many auxiliary tasks are defined procedurally, and hence can be treated as an essentially infinite source of information about the environment. Based on this observation, we study the effectiveness of auxiliary tasks for learning rich representations, focusing on the setting where the number of tasks and the size of the agent’s network are simultaneously increased. For this purpose, we derive a new family of auxiliary tasks based on the successor measure. These tasks are easy to implement and have appealing theoretical properties. Combined with a suitable off-policy learning rule, the result is a representation learning algorithm that can be understood as extending Mahadevan & Maggioni (2007)’s proto-value functions to deep reinforcement learning – accordingly, we call the resulting object proto-value networks. Through a series of experiments on the Arcade Learning Environment, we demonstrate that proto-value networks produce rich features that may be used to obtain performance comparable to established algorithms, using only linear approximation and a small number (~4M) of interactions with the environment’s reward function. Link » Jesse Farebrother · Joshua Greaves · Rishabh Agarwal · Charline Le Lan · Ross Goroshin · Pablo Samuel Castro · Marc Bellemare 🔗 - Return Augmentation gives Supervised RL Temporal Compositionality (Poster)  link »    Offline Reinforcement Learning (RL) methods that use supervised learning or sequence modeling (e.g., Decision Transformer) work by training a return-conditioned policy. A fundamental limitation of these approaches, as compared to value-based methods, is that they have trouble generalizing to behaviors that have a higher return than what was seen at training. Value-based offline-RL algorithms like CQL use bootstrapping to combine training data from multiple trajectories to learn strong behaviors from sub-optimal data. We set out to endow RL via Supervised Learning (RvS) methods with this form of temporal compositionality. To do this, we introduce SuperB, a dynamic programming algorithm for data augmentation that augments the returns in the offline dataset by combining rewards from intersecting trajectories. We show theoretically that SuperB can improve sample complexity and enable RvS to find optimal policies in cases where it previously fell behind the performance of value-based methods. Empirically, we find that SuperB improves the performance of RvS in several offline RL environments, surpassing the prior state-of-the-art RvS agents in AntMaze by orders of magnitude and offering performance competitive with value-based algorithms on the D4RL-gym tasks. Link » Keiran Paster · Silviu Pitis · Sheila McIlraith · Jimmy Ba 🔗 - Design Process is a Reinforcement Learning Problem (Poster)  link »    While reinforcement learning has been used widely in research during the past few years, it found fewer real-world applications than supervised learning due to some weaknesses that the RL algorithms suffer from, such as performance degradation in transitioning from the simulator to the real world. Here, we argue the design process is a reinforcement learning problem and can potentially be a proper application for RL algorithms as it is an offline process and conventionally is done in CAD software - a sort of simulator. This creates opportunities for using RL methods and, at the same time, raises challenges. While the design processes are so diverse, here we focus on the space layout planning (SLP), frame it as an RL problem under the Markov Decision Process, and use PPO to address the layout design problem. To do so, we developed an environment named RLDesigner, to simulate the SLP. The RLDesigner is an OpenAI Gym compatible environment that can be easily customized to define a diverse range of design scenarios. We publicly share the environment to encourage both RL and architecture communities to use it for testing different RL algorithms or in their design practice. The codes are available in the following GitHub repository [URL: we do not share the URL now due to the double-blind procedure, but we attach the codes as supplementary materials. We will share the repository URL after the review process]. Link » Reza Kakooee · Benjamin Dillenburger 🔗 - Bayesian Q-learning With Imperfect Expert Demonstrations (Poster)  link »    Guided exploration with expert demonstrations improves data efficiency for reinforcement learning, but current algorithms often overuse expert information. We propose a novel algorithm to speed up Q-learning with the help of a limited amount of imperfect expert demonstrations. The algorithm is based on a Bayesian framework to model suboptimal expert actions and derives Q-values' update rules by maximizing the posterior probability. It weighs expert information by the uncertainty of learnt Q-values and avoids excessive reliance on expert data, gradually reducing the usage of uninformative expert data. Experimentally, we evaluate our approach on a sparse-reward chain environment and six more complicated Atari games with delayed rewards. With the proposed methods, we can achieve better results than Deep Q-learning from Demonstrations (Hester et al., 2017) in most environments. Link » Fengdi Che · Xiru Zhu · Doina Precup · David Meger · Gregory Dudek 🔗 - Efficient Deep Reinforcement Learning Requires Regulating Statistical Overfitting (Poster)  link » Deep reinforcement learning algorithms that learn policies by trial-and-error must learn from limited amounts of data collected by actively interacting with the environment. While many prior works have shown that proper regularization techniques are crucial for enabling data-efficient RL, a general understanding of the bottlenecks in data-efficient RL has remained unclear. Consequently, it has been difficult to devise a universal technique that works well across all domains. In this paper, we attempt to understand the primary bottleneck in sample-efficient deep RL by examining several potential hypotheses such as non-stationarity, excessive action distribution shift, and overfitting. We perform thorough empirical analysis on state-based DeepMind control suite (DMC) tasks in a controlled and systematic way to show that statistical overfitting on the temporal-difference (TD) error is the main culprit that severely affects the performance of deep RL algorithms, and prior methods that lead to good performance do in fact, control the amount of statistical overfitting. This observation gives us a robust principle for making deep RL efficient: we can hill-climb on a notion of validation temporal-difference error by utilizing any form of regularization techniques from supervised learning. We show that a simple online model selection method that targets the statistical overfitting issue is effective across state-based DMC and Gym tasks. Link » Qiyang Li · Aviral Kumar · Ilya Kostrikov · Sergey Levine 🔗 - Pre-Training for Robots: Leveraging Diverse Multitask Data via Offline Reinforcement Learning (Poster)  link » Recent progress in deep learning highlights the tremendous potential of utilizing diverse datasets for achieving effective generalization and makes it enticing to consider leveraging broad datasets for attaining more robust generalization in robotic learning as well. However, in practice we likely will want to learn a new skill in a new environment that is unlikely to be contained in the prior data. Therefore we ask: how can we leverage existing diverse offline datasets in combination with small amounts of task-specific data to solve new tasks, while still enjoying the generalization benefits of training on large amounts of data? In this paper, we demonstrate that end-to-end offline RL can be an effective approach for doing this, without the need for any representation learning or vision-based pre-training. We present pre-training for robots (PTR), a framework based on offline RL that attempts to effectively learn new tasks by combining pre-training on existing robotic datasets with rapid fine-tuning on a new task, with as a few as 10 demonstrations. At its core, PTR applies an existing offline RL method such as conservative Q-learning (CQL), but extends it to include several crucial design decisions that enable PTR to actually work and outperform a variety of prior methods. To the best of our knowledge, PTR is the first offline RL method that succeeds at learning new tasks in a new domain on a real WidowX robot with as few as 10 task demonstrations, by effectively leveraging an existing dataset of diverse multi-task robot data collected in a variety of toy kitchens. We present an accompanying overview video at https://www.youtube.com/watch?v=yAWgyLJD5lY&ab_channel=PTRICLR Link » Anikait Singh · Aviral Kumar · Frederik Ebert · Yanlai Yang · Chelsea Finn · Sergey Levine 🔗 - Offline Reinforcement Learning from Heteroskedastic Data Via Support Constraints (Poster)  link » Offline reinforcement learning (RL) learns policies entirely from static datasets, thereby avoiding the challenges associated with online data collection. Practical applications of offline RL will inevitably require learning from datasets where the variability of demonstrated behaviors changes non-uniformly across the state space. For example, at a red light, nearly all human drivers behave similarly by stopping, but when merging onto a highway, some drivers merge quickly, efficiently, and safely, while many hesitate or merge dangerously. We show that existing popular offline RL methods based on distribution constraints fail to learn from data with such non-uniform change in the variability of demonstrated behaviors, often due to the requirement to stay close to the behavior policy to the same extent across the state space. We demonstrate this failure mode both theoretically and experimentally. Ideally, the learned policy should be free to choose per-state how closely to follow the behavior policy to maximize long-term return, as long as the learned policy stays within the support of the behavior policy. To instantiate this principle, we reweight the data distribution in conservative Q-learning and show that support constraints emerge when doing so. The reweighted distribution is a mixture of the current policy and an additional policy trained to mine poor actions that are likely under the behavior policy. Our method CQL (ReDS) is simple, theoretically motivated, and improves performance across a wide range of offline RL problems in Atari games, navigation, and pixel-based manipulation. Link » Anikait Singh · Aviral Kumar · Quan Vuong · Yevgen Chebotar · Sergey Levine 🔗 - Variance Double-Down: The Small Batch Size Anomaly in Multistep Deep Reinforcement Learning (Poster)  link »    In deep reinforcement learning, multi-step learning is almost unavoidable to achieve state-of-the-art performance. However, the increased variance that multistep learning brings makes it difficult to increase the update horizon beyond relatively small numbers. In this paper, we report the counterintuitive finding that decreasing the batch size parameter improves the performance of many standard deep RL agents that use multi-step learning. It is well-known that gradient variance decreases with increasing batch sizes, so obtaining improved performance by increasing variance on two fronts is a rather surprising finding. We conduct a broad set of experiments to better understand what we call the variance doubledown phenomenon. Link » Johan Obando Ceron · Marc Bellemare · Pablo Samuel Castro 🔗 - Fantastic Rewards and How to Tame Them: A Case Study on Reward Learning for Task-Oriented Dialogue Systems (Poster)  link » When learning task-oriented dialogue (TOD) agents, one can naturally utilize reinforcement learning (RL) techniques to train dialogue strategies to achieve user-specific goals. Prior works mainly focus on adopting advanced RL techniques to train the TOD agents, while the design of the reward function is not well studied. This paper aims at answering the question of how to efficiently learn and leverage a reward function for training end-to-end TOD agents. Specifically, we introduce two generalized objectives for reward-function learning, inspired by the classical learning-to-rank literature. Further, we utilize the learned reward-function to guide the training of the end-to-end TOD agent. With the proposed techniques, we achieve competitive results on the end-to-end response-generation task on the Multiwoz 2.0 dataset. Link » Yihao Feng · Shentao Yang · Shujian Zhang · Jianguo Zhang · Caiming Xiong · Mingyuan Zhou · Huan Wang 🔗 - In the ZONE: Measuring difficulty and progression in curriculum generation (Poster)  link »    A common strategy in curriculum generation for reinforcement learning is to train a teacher network to generate tasks that enable student learning. But, what kind of tasks enables this? One answer is tasks belonging to a student's zone of proximal development (ZPD), a concept from developmental psychology. These are tasks that are not too easy and not too hard for the student. Albeit intuitive, ZPD is not well understood computationally. We propose ZONE, a novel computational framework that operationalizes ZPD. It formalizes ZPD through the language of Bayesian probability theory, revealing that tasks should be selected by difficulty (the student's probability of task success) and learning progression (the degree of change in the student's model parameters). ZONE instantiates two techniques that enforce the teacher to pick tasks within the student's ZPD. One is \textsc{Reject}, which rejects tasks outside of a difficulty scope, and the other is \textsc{Grad}, which prioritizes tasks that maximize the student's gradient norm. We apply these techniques to existing curriculum learning algorithms. We show that they improve the student’s generalization performance on discrete MiniGrid environments and continuous control MuJoCo domains with up to $9 \times$ higher success. ZONE also accelerates the student's learning by training with $10\times$ less data. Link » Rose Wang · Jesse Mu · Dilip Arumugam · Natasha Jaques · Noah Goodman 🔗 - Better state exploration using action sequence equivalence (Poster)  link »    Incorporating prior knowledge in reinforcement learning algorithms is mainly an open question. Even when insights about the environment dynamics are available, reinforcement learning is traditionally used in a \emph{tabula rasa} setting and must explore and learn everything from scratch.In this paper, we consider the problem of exploiting priors about action sequence equivalence: that is, when different sequences of actions produce the same effect.We propose a new local exploration strategy calibrated to minimize collisions and maximize new state visitations. We show that this strategy can be computed at little cost, by solving a convex optimization problem.By replacing the usual $\epsilon$-greedy strategy in a DQN, we demonstrate its potential in several environments with various dynamic structures. Link » Nathan Grinsztajn · Toby Johnstone · Johan Ferret · philippe preux 🔗 - Deep Learning of Intrinsically Motivated Options in the Arcade Learning Environment (Poster)  link »    In Reinforcement Learning, Intrinsic Motivation motivates directed behaviors through a wide range of reward-generating methods. Depending on the task and environment, these rewards can be useful, might complement each other, but can also break down entirely, as seen with the noisy TV problem for curiosity. We therefore argue that scalability and robustness, among others, are key desirable properties of a method to incorporate intrinsic rewards, which a simple weighted sum of reward lacks. In a tabular setting, Explore Options let the agent call an intrinsically motivated policy in order to learn from its trajectories. We introduce Deep Explore Options, revising Explore Options within the Deep Reinforcement Learning paradigm to tackle complex visual problems. Deep Explore Options can naturally learn from several unrelated intrinsic rewards, ignore harmful intrinsic rewards, learn to balance exploration, but also isolate exploitative and exploratory behaviors for independent usage. We test Deep Explore Options on hard and easy exploration games of the Atari Suite, following a benchmarking study to ensure fairness. Our empirical results show that they achieve similar results than weighted sum baselines, while maintaining their key properties. Link » Louis Bagot · Kevin Mets · Tom De Schepper · Steven Latre 🔗 - Guiding Exploration Towards Impactful Actions (Poster)  link »    To solve decision making tasks in unknown environments, artificial agents need to explore their surroundings. While simple tasks can be solved through naive exploration methods such as action noise, complex tasks require exploration objectives that direct the agent to novel states. However, current exploration objectives typically reward states purely based on how much the agent learns from them, regardless of whether the states are likely to be useful for solving later tasks. In this paper, we propose to guide exploration by empowerment to focus the agent on exploring regions in which it has a strong influence over its environment. We introduce a simple information-theoretic estimator of the agent's empowerment that is added as a reward term to any reinforcement learning method. On a novel BridgeWalk environment, we find that guiding exploration by empowerment helps the agent avoid falling into the unpredictable water, which substantially accelerates exploration and task learning. Experiments on Atari games demonstrate that the approach is general and often leads to improved performance. Link » Vaibhav Saxena · Jimmy Ba · Danijar Hafner 🔗 - Domain Invariant Q-Learning for model-free robust continuous control under visual distractions (Poster)  link »    End-to-end reinforcement learning on images showed significant performance progress in the recent years, especially with regularization to value estimation brought by data augmentation \citep{yarats2020image}. At the same time, domain randomization and representation learning helped push the limits of these algorithms in visually diverse environments, full of distractors and spurious noise, making RL more robust to unrelated visual features. We present DIQL, a method that combines risk invariant regularization and domain randomization to reduce out-of-distribution generalization gap for temporal-difference learning. In this work, we draw a link by framing domain randomization as a richer extension of data augmentation to RL and support its generalized use. Our model-free approach improve baselines performances without the need of additional representation learning objectives and with limited additional computational cost. We show that DIQL outperforms existing methods on complex visuo-motor control environment with high visual perturbation. In particular, our approach achieves state-of the-art performance on the Distracting Control Suite benchmark, where we evaluate the robustness to a number of visual perturbators, as well as OOD generalization and extrapolation capabilities. Link » Tom Dupuis · Jaonary Rabarisoa · Quoc Cuong PHAM · David Filliat 🔗 - Multi-Agent Policy Transfer via Task Relationship Modeling (Poster)  link »    Team adaptation to new cooperative tasks is a hallmark of human intelligence, which has yet to be fully realized in learning agents. Previous works on multi-agent transfer learning accommodate teams of different sizes, but heavily rely on the generalization ability of neural networks for adapting to unseen tasks. We posit that the relationship among tasks provides the key information for policy adaptation. To utilize such relationship for efficient transfer, we try to discover and exploit the knowledge among tasks from different teams, propose to learn effect-based task representations as a common latent space among tasks, and use it to build an alternatively fixed training scheme. We demonstrate that the task representation can capture the relationship among teams and generalize to unseen tasks. As a result, the proposed method can help transfer learned cooperation knowledge to new tasks after training on a few source tasks, and the learned transferred policies can also help solve tasks that are hard to learn from scratch. Link » Rong-Jun Qin · Feng Chen · Tonghan Wang · Lei Yuan · Xiaoran Wu · Yipeng Kang · Zongzhang Zhang · Chongjie Zhang · Yang Yu 🔗 - Foundation Models for History Compression in Reinforcement Learning (Poster)  link »    Agents interacting under partial observability require access to past observations via a memory mechanism in order to approximate the true state of the environment.Recent work suggests that leveraging language as abstraction provides benefits for creating a representation of past events.History Compression via Language Models (HELM) leverages a pretrained Language Model (LM) for representing the past. It relies on a randomized attention mechanism to translate environment observations to token embeddings.In this work, we show that the representations resulting from this attention mechanism can collapse under certain conditions. This causes blindness of the agent to subtle changes in the environment that may be crucial in solving a certain task. We propose a solution to this problem consisting of two parts. First, we improve upon HELM by substituting the attention mechanism with a feature-wise centering-and-scaling operation. Second, we take a step toward semantic history compression by leveraging foundation models, such as CLIP, to encode observations, which further improves performance. By combining foundation models, our agent is able to solve the challenging MiniGrid-Memory environment.Surprisingly, however, our experiments suggest that this is not due to the semantic enrichment of the representation presented to the LM, but rather due to the discriminative power provided by CLIP. Link » Fabian Paischer · Thomas Adler · Andreas Radler · Markus Hofmarcher · Sepp Hochreiter 🔗 - A Game-Theoretic Perspective of Generalization in Reinforcement Learning (Poster)  link »    Generalization in reinforcement learning (RL) is of importance for real deployment of RL algorithms. Various schemes are proposed to address the generalization issues, including transfer learning, multi-task learning, meta learning, as well as robust and adversarial reinforcement learning. However, there is not a unified formulation of various schemes and comprehensive comparisons of methods across different schemes. In this work, we propound GiRL, a game-theoretic framework for generalization in reinforcement learning, where a RL agent is trained against an adversary over a set of tasks, over which the adversary can manipulate the distributions within a given threshold. With different configurations, GiRL is capable of reducing the various schemes mentioned above. To solve GiRL, we adapt the widely-used method in game theory, policy space response oracle (PSRO) framework with three significant modifications as follows: i) we adopt model-agnostic meta learning (MAML) as the best-response oracle, ii) we propose a modified projected replicated dynamics, i.e., R-PRD, which ensures the computed meta-strategy for the adversary falls in the threshold, and iii) we also propose a protocol of few-shot learning for multiple strategies during testing. Extensive experiments on MuJoCo environments demonstrate that our proposed method outperforms state-of-the-art baselines, e.g., MAML. Link » Chang Yang · RUIYU WANG · Xinrun Wang · Zhen Wang 🔗 - Imitating Human Behaviour with Diffusion Models (Poster)  link »    Diffusion models have emerged as powerful generative models in the text-to-image domain. This paper studies their application as observation-to-action models for imitating human behaviour in sequential environments. Human behaviour is stochastic and multimodal, with structured correlations between action dimensions. Meanwhile, standard modelling choices in behaviour cloning are limited in their expressiveness and may introduce bias into the cloned policy. We begin by pointing out the limitations of these choices. We then propose that diffusion models are an excellent fit for imitating human behaviour, since they learn an expressive distribution over the joint action space. We introduce several innovations to make diffusion models suitable for sequential environments; designing suitable architectures, investigating the role of guidance, and developing reliable sampling strategies. Experimentally, diffusion models closely match human demonstrations in a simulated robotic control task and a modern 3D gaming environment. Link » Tim Pearce · Tabish Rashid · Anssi Kanervisto · David Bignell · Mingfei Sun · Raluca Georgescu · Sergio Valcarcel Macua · Shan Zheng Tan · Ida Momennejad · Katja Hofmann · Sam Devlin 🔗 - EUCLID: Towards Efficient Unsupervised Reinforcement Learning with Multi-choice Dynamics Model (Poster)  link »    Unsupervised reinforcement learning (URL) poses a promising paradigm to learn useful behaviors in a task-agnostic environment without the guidance of extrinsic rewards to facilitate the fast adaptation of various downstream tasks. Previous works focused on the pre-training in a model-free manner while lacking the study of transition dynamics modeling that leaves a large space for the improvement of sample efficiency in downstream tasks. To this end, we propose an Efficient Unsupervised Reinforcement Learning Framework with Multi-choice Dynamics model (EUCLID), which introduces a novel model-fused paradigm to jointly pre-train the dynamics model and unsupervised exploration policy in the pre-training phase, thus better leveraging the environmental samples and improving the downstream task sampling efficiency. However, constructing a generalizable model which captures the local dynamics under different behaviors remains a challenging problem. We introduce the multi-choice dynamics model that covers different local dynamics under different behaviors concurrently, which uses different heads to learn the state transition under different behaviors during unsupervised pre-training and selects the most appropriate head for prediction in the downstream task. Experimental results in the manipulation and locomotion domains demonstrate that EUCLID achieves state-of-the-art performance with high sample efficiency, basically solving the state-based URLB benchmark and reaching a mean normalized score of 104.0±1.2% in downstream tasks with 100k fine-tuning steps, which is equivalent to DDPG’s performance at 2M interactive steps with 20× more data. Codes and visualization videos are released on our homepage. Link » Yifu Yuan · Jianye Hao · Fei Ni · Yao Mu · YAN ZHENG · Yujing Hu · Jinyi Liu · Yingfeng Chen · Changjie Fan 🔗 - ERL-Re$^2$: Efficient Evolutionary Reinforcement Learning with Shared State Representation and Individual Policy Representation (Poster)  link »    Deep Reinforcement Learning (Deep RL) and Evolutionary Algorithm (EA) are two major paradigms of policy optimization with distinct learning principles, i.e., gradient-based v.s. gradient-free. An appealing research direction is integrating Deep RL and EA to devise new methods by fusing their complementary advantages. However, existing works on combining Deep RL and EA have two common drawbacks:1) the RL agent and EA agents learn their policies individually, neglecting efficient sharing of useful common knowledge; 2) parameter-level policy optimization guarantees no semantic level of behavior evolution for the EA side. In this paper, we propose Evolutionary Reinforcement Learning with Two-scale State Representation and Policy Representation (ERL-Re$^2$), a novel solution to the aforementioned two drawbacks. The key idea of ERL-Re$^2$ is two-scale representation: all EA and RL policies share the same nonlinear state representation while maintaining individual linear policy representations. The state representation conveys expressive common features of the environment learned by all the agents collectively; the linear policy representation provides a favorable space for efficient policy optimization, where novel behavior-level crossover and mutation operations can be performed. Moreover, the linear policy representation allows convenient generalization of policy fitness with the help of Policy-extended Value Function Approximator (PeVFA), further improving the sample efficiency of fitness estimation. The experiments on a range of continuous control tasks show that ERL-Re$^2$ consistently outperforms strong baselines and achieves significant improvement over both its Deep RL and EA components. Link » Pengyi Li · Hongyao Tang · Jianye Hao · YAN ZHENG · Xian Fu · Zhaopeng Meng 🔗 - Quantization-aware Policy Distillation (QPD) (Poster)  link »    Recent advancements have made Deep Reinforcement Learning (DRL) exceedingly more powerful, but the produced models remain very computationally complex and therefore difficult to deploy on edge devices.Compression methods such as quantization and distillation can be used to increase the applicability of DRL models on these low-power edge devices by decreasing the necessary precision and number of operations respectively. Training in low-precision is notoriously less stable however, which is amplified by the decrease in representational power when limiting the number of trainable parameters. We propose Quantization-aware Policy Distillation (QPD), which overcomes this instability by providing a smoother transition from high to low-precision network parameters. A new distillation loss specifically designed for the compression of actor-critic networks is also defined, resulting in a higher accuracy after compression. Our experiments show that these combined methods can effectively compress a network down to 0.5% of its original size, without any loss in performance. Link » Thomas Avé · Kevin Mets · Tom De Schepper · Steven Latre 🔗 - Fast and Precise: Adjusting Planning Horizon with Adaptive Subgoal Search (Poster)  link »    Complex reasoning problems contain states that vary in the computational cost required to determine a good action plan. Taking advantage of this property, we propose Adaptive Subgoal Search (AdaSubS), a search method that adaptively adjusts the planning horizon. To this end, AdaSubS generates diverse sets of subgoals at different distances. A verification mechanism is employed to filter out unreachable subgoals swiftly, allowing to focus on feasible further subgoals. In this way, AdaSubS benefits from the efficiency of planning with longer subgoals and the fine control with the shorter ones, and thus scales well to difficult planning problems. We show that AdaSubS significantly surpasses hierarchical planning algorithms on three complex reasoning tasks: Sokoban, the Rubik's Cube, and inequality proving benchmark INT. Link » Michał Zawalski · Michał Tyrolski · Konrad Czechowski · Damian Stachura · Piotr Piękos · Tomasz Odrzygóźdź · Yuhuai Wu · Łukasz Kuciński · Piotr Miłoś 🔗 - Cyclophobic Reinforcement Learning (Poster)  link »    In environments with sparse rewards finding a good inductive bias for exploration is crucial to the agent’s success. However, there are two competing goals: novelty search and systematic exploration. While existing approaches such as curiousity-driven exploration find novelty, they sometimes do not systematically explore the whole state space, akin to depth-first-search vs breadth-first-search. In this paper, we propose a new intrinsic reward that is cyclophobic, i.e. it does not reward novelty, but punishes redundancy by avoiding cycles. Augmenting the cyclophobic intrinsic reward with a sequence of hierarchical representations based on the agent’s cropped observations we are able to achieve excellent results in the MiniGrid and MiniHack environments. Both are particularly hard, as they require complex interactions with different objects in order to be solved. Detailed comparisons with previous approaches and thorough ablation studies show that our newly proposed cyclophobic reinforcement learning is vastly more efficient than other state of the art methods. Link » Stefan Wagner · Peter Arndt · Jan Robine · Stefan Harmeling 🔗 - AsymQ: Asymmetric Q-loss to mitigate overestimation bias in off-policy reinforcement learning (Poster)  link » It is well-known that off-policy deep reinforcement learning algorithms suffer from overestimation bias in value function approximation. Existing methods to reduce overestimation bias often utilize multiple value function estimators. Consequently, these methods have a larger time and memory consumption. In this work, we propose a new class of policy evaluation algorithms dubbed, \textbf{AsymQ}, that use asymmetric loss functions to train the Q-value network. Departing from the symmetric loss functions such as mean squared error~(MSE) and Huber loss on the Temporal difference~(TD) error, we adopt asymmetric loss functions of the TD-error to impose a higher penalty on overestimation error. We present one such AsymQ loss called \textbf{Softmax MSE~(SMSE)} that can be implemented with minimal modifications to the standard policy evaluation. Empirically, we show that using SMSE loss helps reduce estimation bias, and subsequently improves policy performance when combined with standard reinforcement learning algorithms. With SMSE, even the Deep Deterministic Policy Gradients~(DDPG) algorithm can achieve performance comparable to that of state-of-the-art methods such as the Twin-Delayed DDPG (TD3) and Soft Actor Critic~(SAC) on challenging environments in the OpenAI Gym MuJoCo benchmark. We additionally demonstrate that the proposed SMSE loss can also boost the performance of Deep Q learning (DQN) in Atari games with discrete action spaces. Link » Qinsheng Zhang · Arjun Krishna · Sehoon Ha · Yongxin Chen 🔗 - Fine-tuning Offline Policies with Optimistic Action Selection (Poster)  link »    Offline reinforcement learning algorithms can train performant policies for hard tasks using previously-collected datasets. However, the quality of the offline dataset often limits the levels of performance possible. We consider the problem of improving offline policies through online fine-tuning. Offline RL requires a pessimistic training objective to mitigate distributional shift between the trained policy and the offline behavior policy, which will make the trained policy averse to picking novel actions. In contrast, online RL requires exploration, or optimism. Thus, fine-tuning online policies with the offline training objective is not ideal. Additionally, loosening the fine-tuning objective to allow for more exploration can potentially destroy the behaviors learned in the offline phase because of the sudden and significant change in the optimization objective. To mitigate this challenge, we propose a method to facilitate exploration during online fine-tuning that maintains the same training objective throughout both offline and online phases, while encouraging exploration. We accomplish this by changing the action-selection method to be more optimistic with respect to the Q-function. By choosing to take actions in the environment with higher expected Q-values, our method is able to explore and improve behaviors more efficiently, obtaining 56% more returns on average than the alternative approaches on several locomotion, navigation, and manipulation tasks. Link » Max Sobol Mark · Ali Ghadirzadeh · Xi Chen · Chelsea Finn 🔗 - SEM2: Enhance Sample Efficiency and Robustness of End-to-end Urban Autonomous Driving via Semantic Masked World Model (Poster)  link »    End-to-end autonomous driving provides a feasible way to automatically maximize overall driving system performance by directly mapping the raw pixels from a front-facing camera to control signals. Recent advanced methods construct a latent world model to map the high dimensional observations into compact latent space. However, the latent states embedded by the world model proposed in previous works may contain a large amount of task-irrelevant information, resulting in low sampling efficiency and poor robustness to input perturbations. Meanwhile, the training data distribution is usually unbalanced, and the learned policy is hard to cope with the corner cases during the driving process. To solve the above challenges, we present a semantic masked recurrent world model (SEM2), which introduces a latent filter to extract key task-relevant features and reconstruct a semantic mask via the filtered features, and is trained with a multi-source data sampler, which aggregates common data and multiple corner case data in a single batch, to balance the data distribution. Extensive experiments on CARLA show that our method outperforms the state-of-the-art approaches in terms of sample efficiency and robustness to input permutations. Link » Zeyu Gao · Yao Mu · Ruoyan Shen · Chen Chen · Yangang Ren · Jianyu Chen · Shengbo Li · Ping Luo · Yanfeng Lu 🔗 - Policy Architectures for Compositional Generalization in Control (Poster)  link »    Several tasks in control, robotics, and planning can be specified through desired goal configurations for entities in the environment. Learning goal-conditioned policies is a natural paradigm to solve such tasks. However, learning and generalizing on complex tasks can be challenging due to variations in number of entities or compositions of goals. To address this challenge, we introduce the Entity-Factored Markov Decision Process (EFMDP), a formal framework for modeling the entity-based compositional structure in control tasks. Geometrical properties of the EFMDP framework provide theoretical motivation for policy architecture design, particularly Deep Sets and popular relational mechanisms such as graphs and self attention. These structured policy architectures are flexible and can be trained end-to-end with standard reinforcement and imitation learning algorithms. We study and compare the learning and generalization properties of these architectures on a suite of simulated robot manipulation tasks, finding that they achieve significantly higher success rates with less data compared to standard multilayer perceptrons. Structured policies also enable broader and more compositional generalization, producing policies that \textbf{extrapolate} to different numbers of entities than seen in training, and \textbf{stitch} together (i.e. compose) learned skills in novel ways. Video results can be found at \url{https://sites.google.com/view/comp-gen-anon}. Link » Allan Zhou · Vikash Kumar · Chelsea Finn · Aravind Rajeswaran 🔗 - Rethinking Learning Dynamics in RL using Adversarial Networks (Poster)  link »    Recent years have seen tremendous progress in methods of reinforcement learning. However, most of these approaches have been trained in a straightforward fashion and are generally not robust to adversity, especially in the meta-RL setting. To the best of our knowledge, our work is the first to propose an adversarial training regime for Multi-Task Reinforcement Learning, which requires no manual intervention or domain knowledge of the environments. Our experiments on multiple environments in the Multi-Task Reinforcement learning domain demonstrate that the adversarial process leads to a better exploration of numerous solutions and a deeper understanding of the environment. We also adapt existing measures of causal attribution to draw insights from the skills learned, facilitating easier re-purposing of skills for adaptation to unseen environments and tasks. Link » Ramnath Kumar · Tristan Deleu · Yoshua Bengio 🔗 - Look Back When Surprised: Stabilizing Reverse Experience Replay for Neural Approximation (Poster)  link »    Experience replay-based sampling techniques are essential to several reinforcement learning (RL) algorithms since they aid in convergence by breaking spurious correlations. The most popular techniques, such as uniform experience replay(UER) and prioritized experience replay (PER), seem to suffer from sub-optimal convergence and significant bias error, respectively. To alleviate this, we introduce a new experience replay method for reinforcement learning, called IntrospectiveExperience Replay (IER). IER picks batches corresponding to data points consecutively before the ‘surprising’ points. Our proposed approach is based on the theoretically rigorous reverse experience replay (RER), which can be shown to remove bias in the linear approximation setting but can be sub-optimal with neural approximation. We show empirically that IER is stable with neural function approximation and has a superior performance compared to the state-of-the-art techniques like uniform experience replay (UER), prioritized experience replay(PER), and hindsight experience replay (HER) on the majority of tasks. Link » Ramnath Kumar · Dheeraj Nagaraj 🔗 - Off-policy Reinforcement Learning with Optimistic Exploration and Distribution Correction (Poster)  link »    Improving the sample efficiency of reinforcement learning algorithms requires effective exploration. Following the principle of $\textit{optimism in the face of uncertainty}$ (OFU), we train a separate exploration policy to maximize the approximate upper confidence bound of the critics in an off-policy actor-critic framework. However, this introduces extra differences between the replay buffer and the target policy regarding their stationary state-action distributions. To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy RL training. In particular, we correct the training distribution for both policies and critics. Empirically, we evaluate our proposed method in several challenging continuous control tasks and show superior performance compared to state-of-the-art methods. We also conduct extensive ablation studies to demonstrate the effectiveness and rationality of the proposed method. Link » Jiachen Li · Shuo Cheng · Zhenyu Liao · Huayan Wang · William Yang Wang · Qinxun Bai 🔗 - Abstract-to-Executable Trajectory Translation for One-Shot Task Generalization (Poster)  link »    Training long-horizon robotic policies in complex physical environments is essential for many applications, such as robotic manipulation. However, learning a policy that can generalize to unseen tasks is challenging. In this work, we propose to achieve one-shot task generalization by decoupling plan generation and plan execution. Specifically, our method solves complex long-horizon tasks in three steps: build a paired abstract environment by simplifying geometry and physics, generate abstract trajectories, and solve the original task by an abstract-to-executable trajectory translator. In the abstract environment, complex dynamics such as physical manipulation are removed, making abstract trajectories easier to generate. However, this introduces a large domain gap between abstract trajectories and the actual executed trajectories as abstract trajectories lack low-level details and aren’t aligned frame-to-frame with the executed trajectory. In a manner reminiscent of language translation, our approach leverages a seq-to-seq model to overcome the large domain gap between the abstract and executable trajectories, enabling the low-level policy to follow the abstract trajectory. Experimental results on various unseen long-horizon tasks with different robot embodiments demonstrate the practicability of our methods to achieve one-shot task generalization. Videos and more details can be found in the supplementary materials and project page: https://sites.google.com/view/abstract-to-executable/ Link » Stone Tao · Xiaochen Li · Tongzhou Mu · Zhiao Huang · Yuzhe Qin · Hao Su 🔗 - Sample-Efficient Reinforcement Learning by Breaking the Replay Ratio Barrier (Poster)  link » Increasing the replay ratio, the number of updates of an agent's parameters per environment interaction, is an appealing strategy for improving the sample efficiency of deep reinforcement learning algorithms. In this work, we show that fully or partially resetting the parameters of deep reinforcement learning agents causes better replay ratio scaling capabilities to emerge. We push the limits of the sample efficiency of carefully-modified algorithms by training them using an order of magnitude more updates than usual, significantly improving their performance in the Atari 100k and DeepMind Control Suite benchmarks. We then provide an analysis of the design choices required for favorable replay ratio scaling to be possible and discuss inherent limits and tradeoffs. Link » Pierluca D'Oro · Max Schwarzer · Evgenii Nikishin · Pierre-Luc Bacon · Marc Bellemare · Aaron Courville 🔗 - Adversarial Policies Beat Professional-Level Go AIs (Poster)  link »    We attack the state-of-the-art Go-playing AI system, KataGo, by training an adversarial policy that plays against a frozen KataGo victim. Our attack achieves a >99% win-rate against KataGo without search, and a >80% win-rate when KataGo uses enough search to be near-superhuman. To the best of our knowledge, this is the first successful end-to-end attack against a Go AI playing at the level of a top human professional. Notably, the adversary does not win by learning to play Go better than KataGo---in fact, the adversary is easily beaten by human amateurs. Instead, the adversary wins by tricking KataGo into ending the game prematurely at a point that is favorable to the adversary. Our results demonstrate that even professional-level AI systems may harbor surprising failure modes. Link » Tony Wang · Adam Gleave · Nora Belrose · Tom Tseng · Michael Dennis · Yawen Duan · Viktor Pogrebniak · Joseph Miller · Sergey Levine · Stuart J Russell 🔗 - VARIATIONAL REPARAMETRIZED POLICY LEARNING WITH DIFFERENTIABLE PHYSICS (Poster)  link »    We study the problem of policy parameterization for reinforcement learning (RL) with high-dimensional continuous action space. Our goal is to find a good way to parameterize the policy of continuous RL as a multi-modality distribution. To this end, we propose to treat the continuous RL policy as a generative model over the distribution of optimal trajectories. We use a diffusion process-like strategy to model the policy and derive a novel variational bound which is the optimization objective to learn the policy. To maximize the objective by gradient descent, we introduce the Reparameterized Policy Gradient Theorem. This theorem elegantly connects classical method REINFORCE and trajectory return optimization for computing the gradient of a policy. Moreover, our method enjoys strong exploration ability due to the multi-modality policy parameterization; notably, when a strong differentiable world model presents, our method also enjoys the fast convergence speed of trajectory optimization. We evaluate our method on numerical problems and manipulation tasks within a differentiable simulator. Qualitative results show its ability to capture the multi-modality distribution of optimal trajectories, and quantitative results show that it can avoid local optima and outperforms baseline approaches. Link » Zhiao Huang · Litian Liang · Zhan Ling · Xuanlin Li · Chuang Gan · Hao Su 🔗 - Efficient Multi-Task Reinforcement Learning via Selective Behavior Sharing (Poster)  link »    The ability to leverage shared behaviors between tasks is critical for sample efficient multi-task reinforcement learning (MTRL). Prior approaches based on parameter sharing or policy distillation share behaviors uniformly across tasks and states or focus on learning one optimal policy. Therefore, they are fundamentally limited when tasks have conflicting behaviors because no one optimal policy exists. Our key insight is that, we can instead share exploratory behavior which can be helpful even when the optimal behaviors differ. Furthermore, as we learn each task, we can guide the exploration by sharing behaviors in a task and state dependent way. To this end, we propose a novel MTRL method, Q-switch Mixture of policies (QMP), that learns to selectively shares exploratory behavior between tasks by using a mixture of policies based on estimated discounted returns to gather training data. Experimental results in manipulation and locomotion tasks demonstrate that our method outperforms prior behavior sharing methods, highlighting the importance of task and state dependent sharing. Link » Grace Zhang · Ayush Jain · Injune Hwang · Shao-Hua Sun · Joseph Lim 🔗 - Contrastive Example-Based Control (Poster)  link » While there are many real-world problems that might benefit from reinforcement learning, these problems rarely fit into the MDP mold: interacting with the environment is often prohibitively expensive and specifying reward functions is challenging. Motivated by these challenges, prior work has developed data-driven approaches that learn entirely from samples from the transition dynamics and examples of high-return states. These methods typically learn a reward function from the high-return states, use that reward function to label the transitions, and then apply an offline RL algorithm to these transitions. While these methods can achieve good results on many tasks, they can be complex, carefully regularizing the reward function and using temporal difference updates. In this paper, we propose a simple and scalable approach to offline example-based control. Unlike prior approaches (e.g., ORIL, VICE, PURL) that learn a reward function, our method will learn an implicit model of multi-step transitions. We show that this implicit model can represent the Q-values for the example-based control problem. Thus, whereas a learned reward function must be combined with an RL algorithm to determine good actions, our model can directly be used to determine these good actions. Across a range of state-based and image-based offline control tasks, we find that our method outperforms baselines that use learned reward functions. Link » Kyle Hatch · Sarthak J Shetty · Benjamin Eysenbach · Tianhe Yu · Rafael Rafailov · Russ Salakhutdinov · Sergey Levine · Chelsea Finn 🔗 - A study of natural robustness of deep reinforcement learning algorithms towards adversarial perturbations (Poster)  link »    Deep reinforcement learning (DRL) has been shown to have numerous potential applications in the real world. However, DRL algorithms are still extremely sensitive to noise and adversarial perturbations, hence inhibiting the deployment of RL in many real-life applications. Analyzing the robustness of DRL algorithms to adversarial attacks is an important prerequisite to enabling the widespread adoption of DRL algorithms. Common perturbations on DRL frameworks during test time include perturbations to the observation and the action channel. Compared with observation channel attacks, action channel attacks are less studied; hence, few comparisons exist that compare the effectiveness of these attacks in DRL literature. In this work, we examined the effectiveness of these two paradigms of attacks on common DRL algorithms and studied the natural robustness of DRL algorithms towards various adversarial attacks in hopes of gaining insights into the individual response of each type of algorithm under different attack conditions. Link » Qisai Liu · Xian Yeow Lee · Soumik Sarkar 🔗 - Multi-skill Mobile Manipulation for Object Rearrangement (Poster)  link »    We study a modular approach to tackle long-horizon mobile manipulation tasks for object rearrangement, which decomposes a full task into a sequence of subtasks. To tackle the entire task, prior work chains multiple stationary manipulation skills with a point-goal navigation skill, which are learned individually on subtasks. Although more effective than monolithic end-to-end RL policies, this framework suffers from compounding errors in skill chaining, e.g., navigating to a bad location where a stationary manipulation skill can not reach its target to manipulate. To this end, we propose that the manipulation skills should include mobility to have flexibility in interacting with the target object from multiple locations and at the same time the navigation skill could have multiple end points which lead to successful manipulation. We operationalize these ideas by implementing mobile manipulation skills rather than stationary ones and training a navigation skill trained with region goal instead of point goal. We evaluate our multi-skill mobile manipulation method M3 on 3 challenging long-horizon mobile manipulation tasks in the Home Assistant Benchmark (HAB), and show superior performance as compared to the baselines. Link » Jiayuan Gu · Devendra Singh Chaplot · Hao Su · Jitendra Malik 🔗 - Visual Reinforcement Learning with Self-Supervised 3D Representations (Poster)  link »    A prominent approach to visual Reinforcement Learning (RL) is to learn an internal state representation using self-supervised methods, which has the potential benefit of improved sample-efficiency and generalization through additional learning signal and inductive biases. However, while the real world is inherently 3D, prior efforts have largely been focused on leveraging 2D computer vision techniques as auxiliary self-supervision. In this work, we present a unified framework for self-supervised learning of 3D representations for motor control. Our proposed framework consists of two phases: a \textit{pretraining} phase where a deep voxel-based 3D autoencoder is pretrained on a large object-centric dataset, and a \textit{finetuning} phase where the representation is jointly finetuned together with RL on in-domain data. We empirically show that our method enjoys improved sample efficiency in simulated manipulation tasks compared to 2D representation learning methods. Additionally, our learned policies transfer zero-shot to a real robot setup with only approximate geometric correspondence, and successfully solve motor control tasks that involve grasping and lifting from \textit{a single, uncalibrated RGB camera}. Videos are available at https://3d4rl.github.io/. Link » Yanjie Ze · Nicklas Hansen · Yinbo Chen · Mohit Jain · Xiaolong Wang 🔗 - One-shot Visual Imitation via Attributed Waypoints and Demonstration Augmentation (Poster)  link »    In this paper, we analyze the behavior of existing techniques and design new solutions for the problem of one-shot visual imitation. In this setting, an agent must solve a novel instance of a novel task given just a single visual demonstration. Our analysis reveals that current methods fall short because of three errors: the DAgger problem arising from purely offline training, last centimeter errors in interacting with objects, and mis-fitting to the task context rather than to the actual task. This motivates the design of our modular approach where we a) separate out task inference (what to do) from task execution (how to do it), and b) develop data augmentation and generation techniques to mitigate mis-fitting. The former allows us to leverage hand-crafted motor primitives for task execution which side-steps the DAgger problem and last centimeter errors, while the latter gets the model to focus on the task rather than the task context. Our model gets 100% and 48% success rates on two recent benchmarks, improving upon the current state-of-the-art by absolute 90% and 20% respectively. Link » Matthew Chang · Saurabh Gupta 🔗 - Building a Subspace of Policies for Scalable Continual Learning (Poster)  link »    The ability to continuously acquire new knowledge and skills is crucial for autonomous agents. Existing methods are typically based on either fixed-size models that struggle to learn a large number of diverse behaviors, or growing-size models that scale poorly with the number of tasks. In this work, we aim to strike a better balance between scalability and performance by designing a method whose size grows adaptively depending on the task sequence. We introduce Continual Subspace of Policies (CSP), a new approach that incrementally builds a subspace of policies for training a reinforcement learning agent on a sequence of tasks. The subspace's high expressivity allows CSP to perform well for many different tasks while growing more slowly than the number of tasks. Our method does not suffer from forgetting and also displays positive transfer to new tasks. CSP outperforms a number of popular baselines on a wide range of scenarios from two challenging domains, Brax (locomotion) and Continual World (robotic manipulation). Interactive visualizations of the subspace can be found at https://share.streamlit.io/continual-subspace/policies/main. Link » Jean-Baptiste Gaya · Thang Long Doan · Lucas Page-Caccia · Laure Soulier · Ludovic Denoyer · Roberta Raileanu 🔗 - Skill Machines: Temporal Logic Composition in Reinforcement Learning (Poster)  link »    A major challenge in reinforcement learning is specifying tasks in a manner that is both interpretable and verifiable. One common approach is to specify tasks through reward machines---finite state machines that encode the task to be solved. We introduce skill machines, a representation that can be learned directly from these reward machines that encode the solution to such tasks. We propose a framework where an agent first learns a set of base skills in a reward-free setting, and then combines these skills with the learned skill machine to produce composite behaviours specified by any regular language, such as linear temporal logics. This provides the agent with the ability to map from complex logical task specifications to near-optimal behaviours zero-shot. We demonstrate our approach in both a tabular and high-dimensional video game environment, where an agent is faced with several of these complex, long-horizon tasks. Our results indicate that the agent is capable of satisfying extremely complex task specifications, producing near optimal performance with no further learning. Finally, we demonstrate that the performance of skill machines can be improved with regular off-policy reinforcement learning algorithms when optimal behaviours are desired. Link » Geraud Nangue Tasse · Devon Jarvis · Steven James · Benjamin Rosman 🔗 - Learning Representations for Reinforcement Learning with Hierarchical Forward Models (Poster)  link »    Learning control from pixels is difficult for reinforcement learning (RL) agents because representation learning and policy learning are intertwined. Previous approaches remedy this issue with auxiliary representation learning tasks, but they either do not consider the temporal aspect of the problem or only consider single-step transitions, which may miss relevant information if important environmental changes take many steps to manifest. We propose Hierarchical $k$-Step Latent (HKSL), an auxiliary task that learns representations via a hierarchy of forward models that operate at varying magnitudes of step skipping while also learning to communicate between levels in the hierarchy. We evaluate HKSL in a suite of 30 robotic control tasks with and without distractors and a task of our creation. We find that HKSL either converges to higher or optimal episodic returns more quickly than several alternative representation learning approaches. Furthermore, we find that HKSL's representations capture task-relevant details accurately across timescales (even in the presence of distractors) and that communication channels between hierarchy levels organize information based on both sides of the communication process, both of which improve sample efficiency. Link » Trevor McInroe · Lukas Schäfer · Stefano Albrecht 🔗 - In-context Reinforcement Learning with Algorithm Distillation (Poster)  link »    We propose Algorithm Distillation (AD), a method for distilling reinforcement learning (RL) algorithms into neural networks by modeling their training histories with a causal sequence model. Algorithm Distillation treats learning to reinforcement learn as an across-episode sequential prediction problem. A dataset of learning histories is generated by a source RL algorithm, and then a causal transformer is trained by autoregressively predicting actions given their preceding learning histories as context. Unlike sequential policy prediction architectures that distill post-learning or expert sequences, AD is able to improve its policy entirely in-context without updating its network parameters. We demonstrate that AD can reinforcement learn in-context in a variety of environments with sparse rewards, combinatorial task structure, and pixel-based observations, and find that AD learns a more data-efficient RL algorithm than the one that generated the source data. Link » Michael Laskin · Luyu Wang · Junhyuk Oh · Emilio Parisotto · Stephen Spencer · Richie Steigerwald · DJ Strouse · Steven Hansen · Angelos Filos · Ethan Brooks · Maxime Gazeau · Himanshu Sahni · Satinder Singh · Volodymyr Mnih 🔗 - Time-Myopic Go-Explore: Learning A State Representation for the Go-Explore Paradigm (Poster)  link »    Very large state spaces with a sparse reward signal are difficult to explore. The lack of a sophisticated guidance results in a poor performance for numerous reinforcement learning algorithms. In these cases, the commonly used random exploration is often not helpful. The literature shows that this kind of environments require enormous efforts to systematically explore large chunks of the state space. Learned state representations can help here to improve the search by providing semantic context and build a structure on top of the raw observations. In this work we introduce a novel time-myopic state representation that clusters temporal close states together while providing a time prediction capability between them. By adapting this model to the Go-Explore paradigm (Ecoffet et al., 2021b), we demonstrate the first learned state representation that reliably estimates novelty instead of using the hand-crafted representation heuristic. Our method shows an improved solution for the detachment problem which still remains an issue at the Go-Explore Exploration Phase. We provide evidence that our proposed method covers the entire state space with respect to all possible time trajectories — without causing disadvantageous conflict-overlaps in the cell archive. Analogous to native Go-Explore, our approach is evaluated on the hard exploration environments MontezumaRevenge, Gravitar and Frostbite (Atari) in order to validate its capabilities on difficult tasks. Our experiments show that time-myopic Go-Explore is an effective alternative for the domain-engineered heuristic while also being more general. The source code of the method is available on GitHub. Link » Marc Höftmann · Jan Robine · Stefan Harmeling 🔗 - MoDem: Accelerating Visual Model-Based Reinforcement Learning with Demonstrations (Poster)  link »    Poor sample efficiency continues to be the primary challenge for deployment of deep Reinforcement Learning (RL) algorithms for real-world applications, and in particular for visuo-motor control. Model-based RL has the potential to be highly sample efficient by concurrently learning a world model and using synthetic rollouts for planning and policy improvement. However, in practice, sample-efficient learning with model-based RL is bottlenecked by the exploration challenge. In this work, we find that leveraging just a handful of demonstrations can dramatically improve the sample-efficiency of model-based RL. Simply appending demonstrations to the interaction dataset, however, does not suffice. We identify key ingredients for leveraging demonstrations in model learning -- policy pretraining, targeted exploration, and oversampling of demonstration data -- which forms the three phases of our model-based RL framework. We empirically study three complex visuo-motor control domains and find that our method is 160%-250%more successful in completing sparse reward tasks compared to prior approaches in the low data regime (100K interaction steps, 5 demonstrations). Link » Nicklas Hansen · Yixin Lin · Hao Su · Xiaolong Wang · Vikash Kumar · Aravind Rajeswaran 🔗 - Scaling up and Stabilizing Differentiable Planning with Implicit Differentiation (Poster)  link »    Differentiable planning promises end-to-end differentiability and adaptivity.However, an issue prevents it from scaling up to larger-scale problems: theyneed to differentiate through forward iteration layers to compute gradients, which couples forward computation and backpropagation and needs to balance forward planner performance and computational cost of the backward pass.To alleviate this issue, we propose to differentiate through the Bellman fixed-point equation to decouple forward and backward passes for Value Iteration Network and its variants, which enables constant backward cost (in planning horizon) and flexible forward budget and helps scale up to large tasks.We study the convergence stability, scalability, and efficiency of the proposed implicit version of VIN and its variants and demonstrate their superiorities on a range of planning tasks: 2D navigation, visual navigation, and 2-DOF manipulation in configuration space and workspace. Link » Linfeng Zhao · Huazhe Xu · Lawson Wong 🔗 - Graph Inverse Reinforcement Learning from Diverse Videos (Poster)  link »    Research on Inverse Reinforcement Learning (IRL) from third-person videos has shown encouraging results on removing the need for manual reward design for robotic tasks. However, most prior works are still limited by training from a relatively restricted domain of videos. In this paper, we argue that the true potential of third-person IRL lies in increasing the diversity of videos for better scaling. To learn a reward function from diverse videos, we propose to perform graph abstraction on the videos followed by temporal matching in the graph space to measure the task progress. Our insight is that a task can be described by entity interactions that form a graph, and this graph abstraction can help remove irrelevant information such as textures, resulting in more robust reward functions. We evaluate our approach, GraphIRL, on cross-embodiment learning in X-MAGICAL and learning from human demonstrations for real-robot manipulation. We show significant improvements in robustness to diverse video demonstrations over previous approaches, and even achieve better results than manual reward design on a real robot pushing task. Videos are available at https://graphirl.github.io/. Link » Sateesh Kumar · Jonathan Zamora · Nicklas Hansen · Rishabh Jangir · Xiaolong Wang 🔗 - Simple Emergent Action Representations from Multi-Task Policy Training (Poster)  link »    Low-level sensory and motor signals in the high-dimensional spaces (e.g., image observations or motor torques) in deep reinforcement learning are complicated to understand or harness for downstream tasks directly. While sensory representations have been widely studied, the representations of actions that form motor skills are yet under exploration. In this work, we find that when a multi-task policy network takes as input states and task embeddings, a space based on the task embeddings emerges to contain meaningful action representations with moderate constraints. Within this space, interpolated or composed embeddings can serve as a high-level interface to instruct the agent to perform meaningful action sequences. Empirical results not only show that the proposed action representations have efficacy for intra-action interpolation and inter-action composition with limited or no learning, but also demonstrate their superior ability in task adaptation to strong baselines in Mujoco locomotion tasks. The evidence elucidates that learning action representations is a promising direction toward efficient, adaptable, and composable RL, forming the basis of abstract action planning and the understanding of motor signal space. Anonymous project page: https://sites.google.com/view/emergent-action-representation Link » Pu Hua · Yubei Chen · Huazhe Xu 🔗 - Adversarial Cheap Talk (Poster)  link » Adversarial attacks in reinforcement learning (RL) often assume highly-privileged access to the victim’s parameters, environment, or data. Instead, this paper proposes a novel adversarial setting called a Cheap Talk MDP in which an Adversary can merely append deterministic messages to the Victim’s observation, resulting in a minimal range of influence. The Adversary cannot occlude ground truth, influence underlying environment dynamics or reward signals, introduce non-stationarity, add stochasticity, see the Victim’s actions, or access their parameters. Additionally, we present a simple meta-learning algorithm called Adversarial Cheap Talk (ACT) to train Adversaries in this setting. We demonstrate that an Adversary trained with ACT can still significantly influence the Victim’s training and testing performance, despite the highly constrained setting. Affecting train-time performance reveals a new attack vector and provides insight into the success and failure modes of existing RL algorithms. More specifically, we show that an ACT Adversary is capable of harming performance by interfering with the learner’s function approximation, or instead helping the Victim’s performance by outputting useful features. Finally, we show that an ACT Adversary can manipulate messages during train-time to directly and arbitrarily control the Victim at test-time. Link » Chris Lu · Timon Willi · Alistair Letcher · Jakob Foerster 🔗 - On the Feasibility of Cross-Task Transfer with Model-Based Reinforcement Learning (Poster)  link »    Reinforcement Learning (RL) algorithms can solve challenging control problems directly from image observations, but they often require millions of environment interactions to do so. Recently, model-based RL algorithms have greatly improved sample-efficiency by concurrently learning an internal model of the world, and supplementing real environment interactions with imagined rollouts for policy improvement. However, learning an effective model of the world from scratch is challenging, and in stark contrast to humans that rely heavily on world understanding and visual cues for learning new skills. In this work, we investigate whether internal models learned by modern model-based RL algorithms can be leveraged to solve new, distinctly different tasks faster. We propose Model-Based Cross-Task Transfer (XTRA), a framework for sample-efficient online RL with scalable pretraining and finetuning of learned world models. By proper pretraining and concurrent cross-task online fine-tuning, we achieve substantial improvements over a baseline trained from scratch; we improve mean performance of model-based algorithm EfficientZero by 23%, and by as much as 73% in some instances. Link » yifan xu · Nicklas Hansen · Zirui Wang · Yung-Chieh Chan · Hao Su · Zhuowen Tu 🔗 - SPRINT: Scalable Semantic Policy Pre-training via Language Instruction Relabeling (Poster)  link »    We propose SPRINT, an approach for scalable offline policy pre-training based on natural language instructions. SPRINT pre-trains an agent’s policy to execute a diverse set of semantically meaningful skills that it can leverage to learn new tasks faster. Prior work on offline pre-training required tedious manual definition of pre-training tasks or learned semantically meaningless skills via random goal-reaching. Instead, our approach SPRINT (Scalable Pre-training via Relabeling Language INsTructions) leverages natural language instruction labels on offline agent experience, collected at scale (e.g., via crowd-sourcing), to define a rich set of tasks with minimal human effort. Furthermore, by using natural language to define tasks, SPRINT can use pre-trained large language models to automatically expand the initial task set. By relabeling and aggregating task instructions, even across multiple training trajectories, we can learn a large set of new skills during pre-training. In experiments using a realistic household simulator, we show that agents pre-trained with SPRINT learn new long-horizon household tasks substantially faster than with previous pre-training approaches. Link » Jesse Zhang · Karl Pertsch · Jiahui Zhang · Taewook Nam · Sung Ju Hwang · Xiang Ren · Joseph Lim 🔗 - Towards True Lossless Sparse Communication in Multi-Agent Systems (Poster)  link »    Communication enables agents to cooperate to achieve their goals. Learning when to communicate, i.e., sparse (in time) communication, and whom to message is particularly important when bandwidth is limited. Recent work in learning sparse individualized communication, however, suffers from high variance during training, where decreasing communication comes at the cost of decreased reward, particularly in cooperative tasks. We use the information bottleneck to reframe sparsity as a representation learning problem, which we show naturally enables lossless sparse communication at lower budgets than prior art. In this paper, we propose a method for true lossless sparsity in communication via Information Maximizing Gated Sparse Multi-Agent Communication (IMGS-MAC). Our model uses two individualized regularization objectives, an information maximization autoencoder and sparse communication loss, to create informative and sparse communication. We evaluate the learned communication language' through direct causal analysis of messages in non-sparse runs to determine the range of lossless sparse budgets, which allow zero-shot sparsity, and the range of sparse budgets that will inquire a reward loss, which is minimized by our learned gating function with few-shot sparsity. To demonstrate the efficacy of our results, we experiment in cooperative multi-agent tasks where communication is essential for success. We evaluate our model with both continuous and discrete messages. We focus our analysis on a variety of ablations to show the effect of message representations, including their properties, and lossless performance of our model. Link » Seth Karten · Mycal Tucker · Siva Kailas · Katia Sycara 🔗 - Mastering the Game of No-Press Diplomacy via Human-Regularized Reinforcement Learning and Planning (Poster)  link »    No-press Diplomacy is a complex strategy game involving both cooperation and competition that has served as a benchmark for multi-agent AI research. While self-play reinforcement learning has resulted in numerous successes in purely adversarial games like chess, Go, and poker, self-play alone is insufficient for achieving optimal performance in domains involving cooperation with humans. We address this shortcoming by first introducing a planning algorithm we call DiL-piKL that regularizes a reward-maximizing policy toward a human imitation-learned policy. We prove that this is a no-regret learning algorithm under a modified utility function. We then show that DiL-piKL can be extended into a self-play reinforcement learning algorithm we call RL-DiL-piKL that provides a model of human play while simultaneously training an agent that responds well to this human model. We used RL-DiL-piKL to train an agent we name Diplodocus. In a 200-game no-press Diplomacy tournament involving 62 human participants spanning skill levels from beginner to expert, two Diplodocus agents both achieved a higher average score than all other participants who played more than two games, and ranked first and third according to an Elo ratings model. Link » Anton Bakhtin · David Wu · Adam Lerer · Jonathan Gray · Athul Jacob · Gabriele Farina · Alexander Miller · Noam Brown 🔗 - PnP-Nav: Plug-and-Play Policies for Generalizable Visual Navigation Across Robots (Poster)  link »    Learning provides a powerful tool for vision-based navigation, but the capabilities of learning-based policies are constrained by limited training data. If we could combine data from all available sources, including multiple kinds of robots, we could train more powerful navigation models. In this paper, we study how goal-conditioned policies for vision-based navigation can be trained on data obtained from many distinct but structurally similar robots, and enable broad generalization across environments and embodiments. We analyze the necessary design decisions for effective data sharing across different robots, including the use of temporal context and standardized action spaces, and demonstrate that an omnipolicy trained from heterogeneous datasets outperforms policies trained on any single dataset. We curate 60 hours of navigation trajectories from 6 distinct robots, and deploy the trained omnipolicy on a range of new robots, including an underactuated quadrotor. We also find that training on diverse, multi-robot datasets leads to robustness against degradation in sensing and actuation. Using a pre-trained base navigational omnipolicy with broad generalization capabilities can bootstrap navigation applications on novel robots going forward, and we hope that PnP represents a step in that direction. Link » Dhruv Shah · Ajay Sridhar · Arjun Bhorkar · Noriaki Hirose · Sergey Levine 🔗 - Offline Reinforcement Learning for Customizable Visual Navigation (Poster)  link » Robotic navigation often requires not only reaching a distant goal, but also satisfying intermediate user preferences on the path, such as obeying the rules of the road or preferring some surfaces over others. Our goal in this paper is to devise a robotic navigation system that can utilize previously collect data to learn navigational strategies that are responsive to user-specified utility functions, such as preferring specific surfaces or staying in sunlight (e.g., to maintain solar power). To this end, we show how offline reinforcement learning can be used to learn reward-specific value functions for long-horizon navigation that can then be composed with planning methods to reach distant goals, while still remaining responsive to user-specified navigational preferences. This approach can utilize large amounts of previously collected data, which is relabeled with the task reward. This makes it possible to incorporate diverse data sources and enable effective generalization in the real world, without any simulation, task-specific data collection, or demonstrations. We evaluate our system, ReViND, using a large navigational dataset from prior work, without any data collection specifically for the reward functions that we test. We demonstrate that our system can control a real-world ground robot to navigate to distant goals using only offline training from this dataset, and exhibit behaviors that qualitatively differ based on the user-specified reward function. Link » Dhruv Shah · Arjun Bhorkar · Hrishit Leen · Ilya Kostrikov · Nicholas Rhinehart · Sergey Levine 🔗 - Multi-Source Transfer Learning for Deep Model-Based Reinforcement Learning (Poster)  link » A crucial challenge in reinforcement learning is to reduce the number of interactions with the environment that an agent requires to master a given task. Transfer learning proposes to address this issue by re-using knowledge from previously learned tasks. However, determining which source task qualifies as optimal for knowledge extraction, as well as the choice regarding which algorithm components to transfer, represent severe obstacles to its application in reinforcement learning. The goal of this paper is to alleviate these issues with modular multi-source transfer learning techniques. Our proposed methodologies automatically learn how to extract useful information from source tasks, regardless of the difference in state-action space and reward function. We support our claims with extensive and challenging cross-domain experiments for visual control. Link » Remo Sasso · Matthia Sabatelli · Marco Wiering 🔗 - Hyperbolic Deep Reinforcement Learning (Poster)  link »    We propose a new class of deep reinforcement learning (RL) algorithms that model latent representations in hyperbolic space. Sequential decision-making requires reasoning about the possible future consequences of current behavior. Consequently, capturing the relationship between key evolving features for a given task is conducive to recovering effective policies. To this end, hyperbolic geometry provides deep RL models with a natural basis to precisely encode this inherently hierarchical information. However, applying existing methodologies from the hyperbolic deep learning literature leads to fatal optimization instabilities due to the non-stationarity and variance characterizing RL gradient estimators. Hence, we design a new general method that counteracts such optimization challenges and enables stable end-to-end learning with deep hyperbolic representations. We empirically validate our framework by applying it to popular on-policy and off-policy RL algorithms on the Procgen and Atari 100K benchmarks, attaining near universal performance and generalization benefits. Given its natural fit, we hope future RL research will consider hyperbolic representations as a standard tool. Link » Edoardo Cetin · Benjamin Chamberlain · Michael Bronstein · jonathan j hunt 🔗 - Investigating Multi-task Pretraining and Generalization in Reinforcement Learning (Poster)  link » Deep reinforcement learning (RL) has achieved remarkable successes in complex single-task settings. However, learning policies that can perform multiple tasks and leverage prior experience to learn faster remains challenging. Despite previous attempts to improve on these areas, our understanding of multi-task training and generalization in reinforcement learning remains limited. In this work we propose to investigate the generalization capabilities of a popular actor-critic method, IMPALA. We build on previous work that has advocated for the use of modes and difficulties of Atari 2600 games as a benchmark for transfer learning in reinforcement learning. We do so by pretraining an agent on multiple flavours of the same game before finetuning on the remaining unseen ones. This protocol simplifies the multi-task pretraining phase by limiting negative interference between tasks and allows us to better understand the dynamics of multi-task training and generalization. We find that, given a fixed amount of pretraining data, agents trained with more variations of a game are able to generalize better. Surprisingly we observe that this advantage can be more pronounced after finetuning for 200M environment frames than when doing zero-shot transfer. This highlights the importance of the learned representation and that performance after finetuning might more appropriate to evaluate generalization in reinforcement learning. We also find that, even though small networks have remained popular to solve Atari 2600 games increasing the capacity of the value and policy network is critical to achieve good performance as we increase the number of pretraining modes and difficulties. Overall our findings emphasize key points that are crucial for efficient multi-task training and generalization in reinforcement learning. Link » Adrien Ali Taiga · Rishabh Agarwal · Jesse Farebrother · Aaron Courville · Marc Bellemare 🔗 - Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning (Poster)  link »    Offline reinforcement learning (RL), which aims to learn an optimal policy using a previously collected static dataset, is an important paradigm of RL. Standard RL methods often perform poorly in this regime due to the function approximation errors on out-of-distribution actions. While a variety of regularization methods have been proposed to mitigate this issue, they are often constrained by policy classes with limited expressiveness that can lead to highly suboptimal solutions. In this paper, we propose representing the policy as a diffusion model, a recent class of highly-expressive deep generative models. We introduce Diffusion Q-learning (Diffusion-QL) that utilizes a conditional diffusion model to represent the policy. In our approach, we learn an action-value function and we add a term maximizing action-values into the training loss of the conditional diffusion model, which results in a loss that seeks optimal actions that are near the behavior policy. We show the expressiveness of the diffusion model-based policy, and the coupling of the behavior cloning and policy improvement under the diffusion model both contribute to the outstanding performance of Diffusion-QL. We illustrate the superiority of our method compared to prior works in a simple 2D bandit example with a multimodal behavior policy. We then show that our method can achieve state-of-the-art performance on the majority of the D4RL benchmark tasks. Link » Zhendong Wang · jonathan j hunt · Mingyuan Zhou 🔗 - Efficient Exploration using Model-Based Quality-Diversity with Gradients (Poster)  link »    Exploration is a key challenge in Reinforcement Learning, especially in long-horizon, deceptive and sparse-reward environments. For such applications, population-based approaches have proven effective. Methods such as Quality-Diversity deals with this by encouraging novel solutions and producing a diversity of behaviours. However, these methods are driven by either undirected sampling (i.e. mutations) or use approximated gradients (i.e. Evolution Strategies) in the parameter space, which makes them highly sample-inefficient. In this paper, we propose a model-based Quality-Diversity approach, relying on gradients and learning in imagination. Our approach optimizes all members of a population simultaneously to maintain both performance and diversity efficiently by leveraging the effectiveness of QD algorithms as good data generators to train deep models. We demonstrate that it maintains the divergent search capabilities of population-based approaches while significantly improving their sample efficiency (5 times faster) and quality of solutions (2 times more performant). Link » Bryan Lim · Manon Flageat · Antoine Cully 🔗 - Choreographer: Learning and Adapting Skills in Imagination (Poster)  link »    Unsupervised skill learning aims to learn a rich repertoire of behaviors without external supervision, providing artificial agents with the ability to control and influence the environment. However, without appropriate knowledge and exploration, skills may provide control only over a restricted area of the environment, limiting their applicability. Furthermore, it is unclear how to leverage the learned skill behaviors for adapting to downstream tasks in a data-efficient manner. We present Choreographer, a model-based agent that exploits its world model to learn and adapt skills in imagination. Our method decouples the exploration and skill learning processes, being able to discover skills in the latent state space of the model. During adaptation, the agent uses a meta-controller to evaluate and adapt the learned skills efficiently by deploying them in parallel in imagination. Choreographer is able to learn skills both from offline data, and by collecting data simultaneously with an exploration policy. The skills can be used to effectively adapt to downstream tasks, as we show in the URL benchmark, where we outperform previous approaches from both pixels and states inputs. The skills also explore the environment thoroughly, finding sparse rewards more frequently, as shown in goal-reaching tasks from the DMC Suite and Meta-World. Project website: https://doubleblind-repos.github.io/ Link » Pietro Mazzaglia · Tim Verbelen · Bart Dhoedt · Alexandre Lacoste · Sai Rajeswar Mudumba 🔗 - Giving Robots a Hand: Broadening Generalization via Hand-Centric Human Video Demonstrations (Poster)  link » Videos of humans performing tasks are a promising data source for robotic manipulation because they are easy to collect in a wide range of scenarios and thus have the potential to significantly expand the generalization capabilities of vision-based robotic manipulators. Prior approaches to learning from human video demonstrations typically use third-person or egocentric data, but a central challenge that must be overcome there is the domain shift caused by the difference in appearance between human and robot morphologies. In this work, we largely reduce this domain gap by collecting hand-centric human video data (i.e., videos captured by a human demonstrator wearing a camera on their arm). To further close the gap, we simply crop out a portion of every visual observation such that the hand is no longer visible. We propose a framework for broadening the generalization of deep robotic imitation learning policies by incorporating unlabeled data in this format---without needing to employ any domain adaptation method, as the human embodiment is not visible in the frame. On a suite of six real robot manipulation tasks, our method substantially improves the generalization performance of manipulation policies acting on hand-centric image observations. Moreover, our method enables robots to generalize to both new environment configurations and new tasks that are unseen in the expert robot imitation data. Link » Moo J Kim · Jiajun Wu · Chelsea Finn 🔗 - Efficient Offline Policy Optimization with a Learned Model (Poster)  link »    MuZero Unplugged presents a promising approach for offline policy learning from logged data. It conducts Monte-Carlo Tree Search (MCTS) with a learned model and leverages Reanalyze algorithm to learn purely from offline data. For good performance, MCTS requires accurate learned models and a large number of simulations, thus costing huge computing time. This paper investigates a few hypotheses where MuZero Unplugged may not work well under the offline RL settings, including 1) learning with limited data coverage; 2) learning from offline data of stochastic environments; 3) improperly parameterized models given the offline data; 4) with a low compute budget. We propose to use a regularized one-step look-ahead approach to tackle the above issues. Instead of planning with the expensive MCTS, we use the learned model to construct an advantage estimation based on a one-step rollout. Policy improvements are towards the direction that maximizes the estimated advantage with regularization of the dataset. We conduct extensive empirical studies with BSuite environments to verify the hypotheses and then run our algorithm on the RL Unplugged Atari benchmark. Experimental results show that our proposed approach achieves stable performance even with an inaccurate learned model. On the large-scale Atari benchmark, the proposed method outperforms MuZero Unplugged by 43%. Most significantly, it uses only 5.6% wall-clock time (i.e., 1 hour) compared to MuZero Unplugged (i.e., 17.8 hours) to achieve a 150% IQM normalized score with the same hardware and software stacks. Link » Zichen Liu · Siyi Li · Wee Sun Lee · Shuicheng Yan · Zhongwen Xu 🔗 - Emergent collective intelligence from massive-agent cooperation and competition (Poster)  link »    Inspired by organisms evolving through cooperation and competition between different populations on Earth, we study the emergence of artificial collective intelligence through massive-agent reinforcement learning. To this end, We propose a new massive-agent reinforcement learning environment, Lux, where dynamic and massive agents in two teams scramble for limited resources and fight off the darkness. In Lux, we build our agents through the standard reinforcement learning algorithm in curriculum learning phases and leverage centralized control via a pixel-to-pixel policy network. As agents co-evolve through self-play, we observe several stages of intelligence, from the acquisition of atomic skills to the development of group strategies. Since these learned group strategies arise from individual decisions without an explicit coordination mechanism, we claim that artificial collective intelligence emerges from massive-agent cooperation and competition. We further analyze the emergence of various learned strategies through metrics and ablation studies, aiming to provide insights for reinforcement learning implementations in massive-agent environments. Link » Hanmo Chen · Stone Tao · JIAXIN CHEN · Weihan Shen · Xihui Li · Chenghui Yu · Sikai Cheng · Xiaolong Zhu · Xiu Li 🔗 - Distance-Sensitive Offline Reinforcement Learning (Poster)  link »    In offline reinforcement learning (RL), one detrimental issue to policy learning is the error accumulation of deep \textit{Q} function in out-of-distribution (OOD) areas. Unfortunately, existing offline RL methods are often over-conservative, inevitably hurting generalization performance outside data distribution. In our study, one interesting observation is that deep \textit{Q} functions approximate well inside the convex hull of training data. Inspired by this, we propose a new method, \textit{DOGE (Distance-sensitive Offline RL with better GEneralization)}. DOGE marries dataset geometry with deep function approximators in offline RL, and enables exploitation in generalizable OOD areas rather than strictly constraining policy within data distribution. Specifically, DOGE trains a state-conditioned distance function that can be readily plugged into standard actor-critic methods as a policy constraint. Simple yet elegant, our algorithm enjoys better generalization compared to state-of-the-art methods on D4RL benchmarks. Theoretical analysis demonstrates the superiority of our approach to existing methods that are solely based on data distribution or support constraints. Link » Li Jianxiong · Xianyuan Zhan · Haoran Xu · Xiangyu Zhu · Jingjing Liu · Ya-Qin Zhang 🔗 - Uncertainty-Driven Exploration for Generalization in Reinforcement Learning (Poster)  link » Value-based methods tend to outperform policy optimization methods when trained and tested in single environments; however, they significantly underperform when trained on multiple environments with similar characteristics and tested on new ones from the same distribution. We investigate the potential reasons behind the poor generalization performance of value-based methods and discover that exploration plays a crucial role in these settings. Exploration is helpful not only for finding optimal solutions to the training environments, but also for acquiring knowledge that helps generalization to other unseen environments. We show how to make value-based methods competitive with policy optimization methods in these settings by using uncertainty-driven exploration and distribtutional RL. Our algorithm is the first value-based method to achieve state-of-the-art on both Procgen and Crafter, two challenging benchmarks for generalization in RL. Link » Yiding Jiang · J. Zico Kolter · Roberta Raileanu 🔗 - Language Models Can Teach Themselves to Program Better (Poster)  link »    Recent Language Models (LMs) achieve breakthrough performance in code generation when trained on human-authored problems, even solving some competitive-programming problems. Self-play has proven useful in games such as Go, and thus it is natural to ask whether LMs can generate their own instructive programming problems to improve their performance. We show that it is possible for an LM to synthesize programming problems and solutions, which are filtered for correctness by a Python interpreter. The LM’s performance is then seen to improve when it is fine-tuned on its own synthetic problems and verified solutions; thus the model “improves itself” using the Python interpreter. Problems are specified formally as programming puzzles [Schuster et al., 2021], a code-based problem format where solutions can easily be verified for correctness by execution. In experiments on publicly-available LMs, test accuracy more than doubles. This RL approach demonstrates the potential for code LMs, with an interpreter, to generate instructive problems and improve their own performance. Link » Patrick Haluptzok · Matthew Bowers · Adam Kalai 🔗 - Graph Q-Learning for Combinatorial Optimization (Poster)  link »    Graph-structured data is ubiquitous throughout natural and social sciences, and Graph Neural Networks (GNNs) have recently been shown to be effective at solving prediction and inference problems on graph data. In this paper, we propose and demonstrate that GNNs can and should be applied to solve Combinatorial Optimization (CO) problems. Combinatorial Optimization (CO) concerns optimizing a function over a discrete solution space that is often intractably large. To learn to solve CO problems, we phrase specifying a candidate solution as a sequential decision-making problem, where the return is related to how close the candidate solution is to optimality. We use a GNN to learn a policy to iteratively build increasingly promising candidate solutions. We present preliminary evidence that GNNs trained through Q-Learning can solve CO problems with performance approaching state-of-the-art heuristic-based solvers, using only a fraction of the parameters and training time. Link » Victoria Magdalena Dax · Jiachen Li · Kevin Leahy · Mykel J Kochenderfer 🔗 - Transformer-based World Models Are Happy With 100k Interactions (Poster)  link »    Deep neural networks have been successful in many reinforcement learning settings. However, compared to human learners they are overly data hungry. To build a sample-efficient world model, we apply a transformer to real-world episodes in an autoregressive manner: not only the compact latent states and the taken actions but also the experienced or predicted rewards are fed into the transformer, so that it can attend flexibly to all three modalities at different time steps. The transformer allows our world model to access previous states directly, instead of viewing them through a compressed recurrent state. By utilizing the Transformer-XL architecture, it is able to learn long-term dependencies while staying computationally efficient. Our transformer-based world model (TWM) generates meaningful, new experience, which is used to train a policy that outperforms previous model-free and model-based reinforcement learning algorithms on the Atari 100k benchmark. Link » Jan Robine · Marc Höftmann · Tobias Uelwer · Stefan Harmeling 🔗 - Contrastive Value Learning: Implicit Models for Simple Offline RL (Poster)  link »    Model-based reinforcement learning (RL) methods are appealing in the offline setting because they allow an agent to reason about the consequences of actions without interacting with the environment. Prior methods learn a 1-step dynamics model, which predicts the next state given the current state and action. These models do not immediately tell the agent which actions to take, but must be integrated into a larger RL framework. Can we model the environment dynamics in a different way, such that the learned model does directly indicate the value of each action? In this paper, we propose Contrastive Value Learning (CVL), which learns an implicit, multi-step model of the environment dynamics. This model can be learned without access to reward functions, but nonetheless can be used to directly estimate the value of each action, without requiring any TD learning. Because this model represents the multi-step transitions implicitly, it avoids having to predict high-dimensional observations and thus scales to high-dimensional tasks. Our experiments demonstrate that CVL outperforms prior offline RL methods on complex continuous control benchmarks. Link » Bogdan Mazoure · Benjamin Eysenbach · Ofir Nachum · Jonathan Tompson 🔗 - CASA: Bridging the Gap between Policy Improvement and Policy Evaluation with Conflict Averse Policy Iteration (Poster)  link »    We study the problem of model-free reinforcement learning, which is often solved following the principle of Generalized Policy Iteration (GPI). While GPI is typically an interplay between policy evaluation and policy improvement, most conventional model-free methods with function approximation assume the independence of GPI steps, despite of the inherent connections between them. In this paper, we present a method that attempts to eliminate the inconsistency between policy evaluation step and policy improvement step, leading to a conflict averse GPI solution with gradient-based functional approximation. Our method is capital to balancing exploitation and exploration between policy-based and value-based methods and is applicable to existed policy-based and value-based methods. We conduct extensive experiments to study theoretical properties of our method and demonstrate the effectiveness of our method on Atari 200M benchmark. Link » Changnan Xiao · Haosen Shi · Jiajun Fan · Shihong Deng · Haiyan Yin 🔗 - MAESTRO: Open-Ended Environment Design for Multi-Agent Reinforcement Learning (Poster)  link »    Open-ended learning methods that automatically generate a curriculum of increasingly challenging tasks serve as a promising avenue toward generally capable reinforcement learning (RL) agents. Existing methods adapt curricula independently over either environment parameters (in single-agent settings) or co-player policies (in multi-agent settings). However, the strengths and weaknesses of co-players can manifest themselves differently depending on environmental features. It is thus crucial to consider the dependency between the environment and co-player when shaping a curriculum in multi-agent domains. In this work, we use this insight and extend Unsupervised Environment Design (UED) to multi-agent environments. We then introduce Multi-Agent Environment-Space Response Oracles (MAESTRO), the first multi-agent UED approach for two-player zero-sum settings. MAESTRO efficiently produces adversarial, joint curricula over both environment parameters and co-player policies and attains minimax-regret guarantees at Nash equilibrium. Our experiments show that MAESTRO outperforms a number of strong baselines on competitive two-player environments, spanning discrete and continuous control. Link » Mikayel Samvelyan · Akbir Khan · Michael Dennis · Minqi Jiang · Jack Parker-Holder · Jakob Foerster · Roberta Raileanu · Tim Rocktäschel 🔗 - Pink Noise Is All You Need: Colored Noise Exploration in Deep Reinforcement Learning (Poster)  link »    In off-policy deep reinforcement learning with continuous action spaces, exploration is often implemented by injecting action noise into the action selection process. Popular algorithms based on stochastic policies, such as SAC or MPO, inject white noise by sampling actions from uncorrelated Gaussian distributions. In many tasks, however, white noise does not provide sufficient exploration, and temporally correlated noise is used instead. A common choice is Ornstein-Uhlenbeck (OU) noise, which is closely related to Brownian motion (red noise). Both red noise and white noise belong to the broad family of colored noise. In this work, we perform a comprehensive experimental evaluation on MPO and SAC to explore the effectiveness of other colors of noise as action noise. We find that pink noise, which is halfway between white and red noise, significantly outperforms white noise, OU noise, and other alternatives on a wide range of environments. Thus, we recommend it as the default choice for action noise in continuous control. Link » Onno Eberhard · Jakob Hollenstein · Cristina Pinneri · Georg Martius 🔗 - Evaluating Long-Term Memory in 3D Mazes (Poster)  link » Intelligent agents need to remember salient information to reason in partially-observed environments. For example, agents with a first-person view should remember the positions of relevant objects even if they go out of view. Similarly, to effectively navigate through rooms agents need to remember the floor plan of how rooms are connected. However, most benchmark tasks in reinforcement learning do not test long-term memory in agents, slowing down progress in this important research direction. In this paper, we introduce the Memory Maze, a 3D domain of randomized mazes specifically designed for evaluating long-term memory in agents. Unlike existing benchmarks, Memory Maze measures long-term memory separate from confounding agent abilities and requires the agent to localize itself by integrating information over time. With Memory Maze, we propose an online reinforcement learning benchmark, a diverse offline dataset, and an offline probing evaluation. Recording a human player establishes a strong baseline and verifies the need to build up and retain memories, which is reflected in their gradually increasing rewards within each episode. We find that current algorithms benefit from training with truncated backpropagation through time and succeed on small mazes, but fall short of human performance on the large mazes, leaving room for future algorithmic designs to be evaluated on the Memory Maze. Link » Jurgis Pašukonis · Timothy Lillicrap · Danijar Hafner 🔗 - Visual Imitation Learning with Patch Rewards (Poster)  link » Visual imitation learning enables reinforcement learning agents to learn to behave from expert visual demonstrations such as videos or image sequences, without explicit, well-defined rewards. Previous reseaches either adopt supervised learning techniques or induce simple and coarse scalar rewards from pixels, neglecting the dense information contained in the image demonstrations.In this work, we propose to measure the expertise of various local regions of image samples, or called patches, and recover multi-dimensional patch rewards accordingly. Patch reward is a more precise rewarding characterization that serves as fine-grained expertise measurement and visual explainability tool.Specifically, we present Adversarial Imitation Learning with Patch Rewards (PatchAIL), which employs a patch-based discriminator to measure the expertise of different local parts from given images and provide patch rewards.The patch-based knowledge is also used to regularize the aggregated reward and stabilize the training.We evaluate our method on the standard pixel-based benchmark DeepMind Control Suite. The experiment results have demonstrated that PatchAIL outperforms baseline methods and provides valuable interpretations for visual demonstrations. Link » Minghuan Liu · Tairan He · Weinan Zhang · Shuicheng Yan · Zhongwen Xu 🔗 - Memory-Efficient Reinforcement Learning with Priority based on Surprise and On-policyness (Poster)  link »    In off-policy reinforcement learning, an agent collects transition data (a.k.a. experience tuples) from the environment and stores them in a replay buffer for the incoming parameter updates. Storing those tuples consumes a large amount of memory when the environment observations are given as images. Large memory consumption is especially problematic when reinforcement learning methods are applied in scenarios where the computational resources are limited. In this paper, we introduce a method to prune relatively unimportant experience tuples by a simple metric that estimates the importance of experiences and saves the overall memory consumption by the buffer. To measure the importance of experiences, we use $\textit{surprise}$ and $\textit{on-policyness}$. Surprise is quantified by the information gain the model can obtain from the experiences and on-policyness ensures that they are relevant to the current policy. In our experiments, we empirically show that our method can significantly reduce the memory consumption by the replay buffer without decreasing the performance in vision-based environments. Link » Ryosuke Unno · Yoshimasa Tsuruoka 🔗 - Learning a Domain-Agnostic Policy through Adversarial Representation Matching for Cross-Domain Policy Transfer (Poster)  link »    The low transferability of learned policies is one of the most critical problems limiting the applicability of learning-based solutions to decision-making tasks. In this paper, we present a way to align latent representations of states and actions between different domains by optimizing an adversarial objective. We train two models, a policy and a domain discriminator, with unpaired trajectories of proxy tasks through behavioral cloning as well as adversarial training. After the latent representations are aligned between domains, a domain-agnostic part of the policy trained with any method in the source domain can be immediately transferred to the target domain in a zero-shot manner. We empirically show that our simple approach achieves comparable performance to the latest methods in zero-shot cross-domain transfer. We also observe that our method performs better than other approaches in transfer between domains with different complexities, whereas other methods fail catastrophically. Link » Hayato Watahiki · Ryo Iwase · Ryosuke Unno · Yoshimasa Tsuruoka 🔗 - Temporal Disentanglement of Representations for Improved Generalisation in Reinforcement Learning (Poster)  link »    Reinforcement Learning (RL) agents are often unable to generalise well to environment variations in the state space that were not observed during training. This issue is especially problematic for image-based RL, where a change in just one variable, such as the background colour, can change many pixels in the image, which can lead to drastic changes in the agent's latent representation of the image, causing the learned policy to fail. To learn more robust representations, we introduce TEmporal Disentanglement (TED), a self-supervised auxiliary task that leads to disentangled image representations exploiting the sequential nature of RL observations. We find empirically that RL algorithms utilising TED as an auxiliary task adapt more quickly to changes in environment variables with continued training compared to state-of-the-art representation learning methods. Since TED enforces a disentangled structure of the representation, we also find that policies trained with TED generalise better to unseen values of variables irrelevant to the task (e.g.\ background colour) as well as unseen values of variables that affect the optimal policy (e.g.\ goal positions). Link » Mhairi Dunion · Trevor McInroe · Kevin Sebastian Luck · Josiah Hanna · Stefano Albrecht 🔗 - Toward Effective Deep Reinforcement Learning for 3D Robotic Manipulation: End-to-End Learning from Multimodal Raw Sensory Data (Poster)  link »    Sample-efficient reinforcement learning (RL) methods capable of learning directly from raw sensory data without the use of human-crafted representations would open up real-world applications in robotics and control. Recent advances in visual RL have shown that learning a latent representation together with existing RL algorithms closes the gap between state-based and image-based training. However, image-based training is still significantly sample-inefficient with respect to learning in 3D continuous control problems (for example, robotic manipulation) compared to state-based training. In this study, we propose an effective model-free off-policy RL method for 3D robotic manipulation that can be trained in an end-to-end manner from multimodal raw sensory data obtained from a vision camera and a robot's joint encoders, without the need for human-crafted representations. Notably, our method is capable of learning a latent multimodal representation and a policy in an efficient, joint, and end-to-end manner from multimodal raw sensory data. Our method, which we dub MERL: Multimodal End-to-end Reinforcement Learning, results in a simple but effective approach capable of significantly outperforming both current state-of-the-art visual RL and state-based RL methods with respect to sample efficiency, learning performance, and training stability in relation to 3D robotic manipulation tasks from DeepMind Control. Link » Samyeul Noh · Hyun Myung 🔗 - Momentum Boosted Episodic Memory for Improving Learning in Long-Tailed RL Environments (Poster)  link »    Conventional Reinforcement Learning (RL) algorithms assume the distribution of the data to be uniform or mostly uniform. However, this is not the case with most real-world applications like autonomous driving or in nature, where animals roam. Some objects are encountered frequently, and most of the remaining experiences occur rarely; the resulting distribution is called Zipfian. Taking inspiration from the theory of complementary learning systems, an architecture for learning from Zipfian distributions is proposed where long tail states are discovered in an unsupervised manner and states along with their recurrent activation are kept longer in episodic memory. The recurrent activations are then reinstated from episodic memory using a similarity search, giving weighted importance. The proposed architecture yields improved performance in a Zipfian task over conventional architectures. Our method outperforms IMPALA by a significant margin of 20.3% when maps/objects occur with a uniform distribution and by 50.2% on the rarest 20% of the distribution. Link » Dolton Fernandes · Pramod Kaushik · Harsh Shukla · Raju Bapi 🔗 - A Ranking Game for Imitation Learning (Poster)  link »    We propose a new framework for imitation learning---treating imitation as a two-player ranking-based game between a policy and a reward. In this game, the reward agent learns to satisfy pairwise performance rankings between behaviors, while the policy agent learns to maximize this reward. In imitation learning, near-optimal expert data can be difficult to obtain, and even in the limit of infinite data cannot imply a total ordering over trajectories as preferences can. On the other hand, learning from preferences alone is challenging as a large number of preferences are required to infer a high-dimensional reward function, though preference data is typically much easier to collect than expert demonstrations. The classical inverse reinforcement learning (IRL) formulation learns from expert demonstrations but provides no mechanism to incorporate learning from offline preferences and vice versa. We instantiate the proposed ranking-game framework with a novel ranking loss giving an algorithm that can simultaneously learn from expert demonstrations and preferences, gaining the advantages of both modalities. Our experiments show that the proposed method achieves state-of-the-art sample efficiency and can solve previously unsolvable tasks in the Learning from Observation (LfO) setting. Link » Harshit Sushil Sikchi · Akanksha Saran · Wonjoon Goo · Scott Niekum 🔗 - Implicit Offline Reinforcement Learning via Supervised Learning (Poster)  link » Offline Reinforcement Learning (RL) via Supervised Learning is a simple and effective way to learn robotic skills from a dataset of varied behaviors. It is as simple as supervised learning and Behavior Cloning (BC) but takes advantage of the return information. On BC tasks, implicit models have been shown to match or outperform explicit ones. Despite the benefits of using implicit models to learn robotic skills via BC, Offline RL via Supervised Learning algorithms have been limited to explicit models. We show how implicit models leverage return information and match or outperform explicit algorithms to acquire robotic skills from fixed datasets. Furthermore, we show how closely related our implicit methods are to other popular RL via Supervised Learning algorithms. Link » Alexandre Piche · Rafael Pardinas · David Vazquez · Igor Mordatch · Igor Mordatch · Chris Pal 🔗 - Distributional deep Q-learning with CVaR regression (Poster)  link »    Reinforcement learning (RL) allows an agent interacting sequentially with an environment to maximize its long-term return, in expectation. In distributional RL (DRL), the agent is also interested in the probability distribution of the return, not just its expected value. This so-called distributional perspective of RL has led to new algorithms with improved empirical performance. In this paper, we recall the atomic DRL (ADRL) framework based on atomic distributions projected via the Wasserstein-2 metric. Then, we derive two new deep ADRL algorithms, namely SAD-Q-learning and MAD-Q-learning (both for the control task). Numerical experiments on various environments compare our approach against existing deep (distributional) RL methods. Link » Mastane Achab · REDA ALAMI · YASSER ABDELAZIZ DAHOU DJILALI · Kirill Fedyanin · Eric Moulines · Maxim Panov 🔗 - The Surprising Effectiveness of Latent World Models for Continual Reinforcement Learning (Poster)  link »    We study the use of model-based reinforcement learning methods, in particular, world models for continual reinforcement learning. In continual reinforcement learning, an agent is required to solve one task and then another sequentially while retaining performance and preventing \emph{forgetting} on past tasks. World models offer a \emph{task-agnostic} solution: they do not require knowledge of task changes. World models are a straight-forward baseline for continual reinforcement learning for three main reasons. Firstly, forgetting in the world model is prevented by persisting existing experience replay buffers across tasks, experience from previous tasks is replayed for learning the world model. Secondly, they are sample efficient. Thirdly and finally, they offer a task-agnostic exploration strategy through the uncertainty in the trajectories generated by the world model. We show that world models are a simple and effective continual reinforcement learning baseline. We study their effectiveness on Minigrid and Minihack continual reinforcement learning benchmarks and show that it outperforms state-of-the-art task-agnostic continual reinforcement learning methods. Link » Samuel Kessler · Piotr Miłoś · Jack Parker-Holder · S Roberts 🔗 - Understanding Hindsight Goal Relabeling Requires Rethinking Divergence Minimization (Poster)  link »    Hindsight goal relabeling has become a foundational technique for multi-goal reinforcement learning (RL). The idea is quite simple: any arbitrary trajectory can be seen as an expert demonstration for reaching the trajectory's end state. Intuitively, this procedure trains a goal-conditioned policy to imitate a sub-optimal expert. However, this connection between imitation and hindsight relabeling is not well understood. Modern imitation learning algorithms are described in the language of divergence minimization, and yet it remains an open problem how to recast hindsight goal relabeling into that framework. In this work, we develop a unified objective for goal-reaching that explains such a connection, from which we can derive goal-conditioned supervised learning (GCSL) and the reward function in hindsight experience replay (HER) from first principles. Experimentally, we find that despite recent advances in goal-conditioned behaviour cloning (BC), multi-goal Q-learning can still outperform BC-like methods; moreover, a vanilla combination of both actually hurts model performance. Under our framework, we study when BC is expected to help, and empirically validate our findings. Our work further bridges goal-reaching and generative modeling, illustrating the nuances and new pathways of extending the success of generative models to RL. Link » Lunjun Zhang · Bradly Stadie 🔗 - Perturbed Quantile Regression for Distributional Reinforcement Learning (Poster)  link »    Distributional reinforcement learning aims to learn distribution of return under stochastic environments. Since the learned distribution of return contains rich information about the stochasticity of the environment, previous studies have relied on descriptive statistics, such as standard deviation, for optimism in the face of uncertainty. However, using the uncertainty from an empirical distribution can hinder convergence and performance when exploring with the certain criterion that has an one-sided tendency on risk in these methods. In this paper, we propose a novel distributional reinforcement learning that explores by randomizing risk criterion to reach a risk-neutral optimal policy. First, we provide a perturbed distributional Bellman optimality operator by distorting the risk measure in action selection. Second, we prove the convergence and optimality of the proposed method by using the weaker contraction property. Our theoretical results support that the proposed method does not fall into biased exploration and is guaranteed to converge to an optimal return distribution. Finally, we empirically show that our method outperforms other existing distribution-based algorithms in various environments including 55 Atari games. Link » Taehyun Cho · Seungyub Han · Heesoo Lee · Kyungjae Lee · Jungwoo Lee 🔗 - Concept-based Understanding of Emergent Multi-Agent Behavior (Poster)  link »    This work studies concept-based interpretability in the context of multi-agent learning. Unlike supervised learning, where there have been efforts to understand a model's decisions, multi-agent interpretability remains under-investigated. This is in part due to the increased complexity of the multi-agent setting---interpreting the decisions of multiple agents over time is combinatorially more complex than understanding individual, static decisisons---but is also a reflection of the limited availability of tools for understanding multi-agent behavior. Interactions between agents, and coordination generally, remain difficult to gauge in MARL. In this work, we propose Concept Bottleneck Policies (CBPs) as a method for learning intrinsically interpretable, concept-based policies with MARL. We demonstrate that, by conditioning each agent's action on a set of human-understandable concepts, our method enables post-hoc behavioral analysis via concept intervention that is infeasible with standard policy architectures. Experiments show that concept interventions over CBPs reliably detect when agents have learned to coordinate with each other in environments that do not demand coordination, and detect those environments in which coordination is required. Moreover, we find evidence that CBPs can detect coordination failures (such as lazy agents) and expose the low-level inter-agent information that underpins emergent coordination. Finally, we demonstrate that our approach matches the performance of standard, non-concept-based policies; thereby achieving interpretability without sacrificing performance. Link » Niko Grupen · Shayegan Omidshafiei · Natasha Jaques · Been Kim 🔗 - Constrained Imitation Q-learning with Earth Mover’s Distance reward (Poster)  link »    We propose constrained Earth Mover's Distance (CEMD) Imitation Q-learning that combines the exploration power of Reinforcement Learning (RL) and the sample efficiency of Imitation Learning (IL). Sample efficiency makes Imitation Q-learning a suitable approach for robot learning. For Q-learning, immediate rewards can be efficiently computed by a greedy variant of Earth Mover's Distance (EMD) between the observed state-action pairs and state-actions in stored expert demonstrations. In CEMD, we constrain the otherwise non-stationary greedy EMD reward by proposing a greedy EMD upper bound estimate and a generic Q-learning lower bound. In PyBullet continuous control benchmarks, CEMD is more sample efficient, achieves higher performance and yields less variance than its competitors. Link » WENYAN Yang · Nataliya Strokina · Joni Pajarinen · Joni-kristian Kamarainen 🔗 - Hierarchical Abstraction for Combinatorial Generalization in Object Rearrangement (Poster)  link »    Object rearrangement is a challenge for embodied agents because solving these tasks requires generalizing across a combinatorially large set of underlying entities that take the value of object states. Worse, these entities are often unknown and must be inferred from sensory percepts. We present a hierarchical abstraction approach to uncover these underlying entities and achieve combinatorial generalization from unstructured inputs. By constructing a factorized transition graph over clusters of object representations inferred from pixels, we show how to learn a correspondence between intervening on states of entities in the agent's model and acting on objects in the environment. We use this correspondence to develop a method for control that generalizes to different numbers and configurations of objects, which outperforms current offline deep RL methods when evaluated on a set of simulated rearrangement and stacking tasks. Link » Michael Chang · Alyssa L Dayan · Franziska Meier · Tom Griffiths · Sergey Levine · Amy Zhang 🔗 - SoftTreeMax: Policy Gradient with Tree Search (Poster)  link »    Policy-gradient methods are widely used for learning control policies. They can be easily distributed to multiple workers and reach state-of-the-art results in many domains. Unfortunately, they exhibit large variance and subsequently suffer from high-sample complexity since they aggregate gradients over entire trajectories. At the other extreme, planning methods, like tree search, optimize the policy using single-step transitions that consider future lookahead. These approaches have been mainly considered for value-based algorithms. Planning-based algorithms require a forward model and are computationally intensive at each step, but are more sample efficient. In this work, we introduce SoftTreeMax, the first approach that integrates tree-search into policy gradient. Traditionally, gradients are computed for single state-action pairs. Instead, our tree-based policy structure leverages all gradients at the tree leaves in each environment step. This allows us to reduce the variance of gradients by three orders of magnitude and to benefit from better sample complexity compared with standard policy gradient. On Atari, SoftTreeMax demonstrates up to 5x better performance in faster run-time compared with distributed PPO. Link » Gal Dalal · Assaf Hallak · Shie Mannor · Gal Chechik 🔗 - Dynamic Collaborative Multi-Agent Reinforcement Learning Communication for Autonomous Drone Reforestation (Poster)  link »    We approach autonomous drone-based reforestation with a collaborative multi-agent reinforcement learning (MARL) setup. Agents can communicate as part of a dynamically changing network. We explore collaboration and communication on the back of a high-impact problem. Forests are the main resource to control rising CO2 conditions. Unfortunately, the global forest volume is decreasing at an unprecedented rate. Many areas are too large and hard to traverse to plant new trees. To efficiently cover as much area as possible, here we propose a Graph Neural Network (GNN) based communication mechanism that enables collaboration. Agents can share location information on areas needing reforestation, which increases viewed area and planted tree count. We compare our proposed communication mechanism with a multi-agent baseline without the ability to communicate. Results show how communication enables collaboration and increases collective performance, planting precision and the risk-taking propensity of individual agents. Link » Philipp Siedler 🔗 - Hypernetwork-PPO for Continual Reinforcement Learning (Poster)  link »    Continually learning new capabilities in different environments, and being ableto solve multiple complex tasks is of great importance for many robotics appli-cations. Modern reinforcement learning algorithms such as Proximal Policy Op-timization can successfully handle surprisingly difficult tasks, but are generallynot suited for multi-task or continual learning. Hypernetworks are a promisingapproach for avoiding catastrophic forgetting, and have previously been used suc-cessfully for continual model-learning in model-based RL. We propose HN-PPO,a continual model-free RL method employing a hypernetwork to learn multiplepolicies in a continual manner using PPO. We demonstrate our method on Door-Gym, and show that it is suitable for solving tasks involving complex dynamicssuch as door opening, while effectively protecting against catastrophic forgetting Link » Philemon Schöpf · Sayantan Auddy · Jakob Hollenstein · Antonio Rodriguez-sanchez 🔗 - DRL-EPANET: Deep reinforcement learning for optimal control at scale in Water Distribution Systems (Poster)  link »    Deep Reinforcement learning has known a revolution in recent years, it has allowed researchers to tackle a wide range of sequential decision problems that were inaccessible to previous methods. However, the use of this technique in Water Distribution Systems is still very shy. In this paper, we show that DRL can be coupled with the widely popular hydraulic simulator Epanet, and that DRL-Epanet can be used on a number of WDS problems that represent a challenge to current techniques. We take as a concrete example the problem of pressure control in WDS. We show that DRL-Epanet can scale to huge action spaces, and we demonstrate its effectiveness on a problem with more than 1 million possible actions at each time step. We also show that it can deal with uncertainty such as stochastic demands, contamination, or other risks, as an example, we take on the problem of pressure control in the presence of random pipe bursts. We show that the BDQ algorithm is able to learn in this setting and we improve it with an algorithmic modification BDQF (BDQ with Fixed actions) which achieves better rewards especially when allowed actions are sparse in the action space. Finally, we argue that DRL-Epanet can be used for real-time control in smart WDS, another advantage over current methods. Link » Anas Belfadil · David Modesto · Jose Martin H. 🔗 - Actor Prioritized Experience Replay (Poster)  link »    A widely-studied deep reinforcement learning (RL) technique known as Prioritized Experience Replay (PER) allows agents to learn from transitions sampled with non-uniform probability proportional to their temporal-difference (TD) error. Although it has been shown that PER is one of the most crucial components for the overall performance of deep RL methods in discrete action domains, many empirical studies indicate that it considerably underperforms actor-critic algorithms in continuous control. We theoretically show that actor networks cannot be effectively trained with transitions that have large TD errors. As a result, the approximate policy gradient computed under the Q-network diverges from the actual gradient computed under the optimal Q-function. Motivated by this, we introduce a new branch of improvements to PER for actor-critic methods, which also regards issues with stability and recent findings behind the poor empirical performance of the algorithm. An extensive set of experiments verifies our theoretical claims and demonstrates that the introduced method obtains substantial gains over PER. Link » Baturay Saglam · Furkan Burak Mutlu · Doğan Can Çiçek · Suleyman Kozat 🔗 - Model and Method: Training-Time Attack for Cooperative Multi-Agent Reinforcement Learning (Poster)  link » The robustness of deep cooperative multi-agent reinforcement learning (MARL) is of great concern and limits the application to real-world risk-sensitive tasks. Adversarial attack is a promising direction to study and improve the robustness of MARL but is largely under-studied. Previous work focuses on deploy-time attacks which may exaggerate attack performance because the MARL learner even does not anticipate the attacker. In this paper, we propose training-time attacks where the learner is allowed to observe and adapt to poisoned experience. For the stealthiness of attacks, we contaminate action sampling and restrict the attack budget so that non-adversarial agents cannot distinguish attacks from exploration noise. We derive two specific attack methods by modeling the influence of action-sampling on experience replay and further on team performance. Experiments show that our methods significantly undermine MARL algorithms by subtly disturbing the exploration-exploitation balance during the learning process. Link » Siyang Wu · Tonghan Wang · Xiaoran Wu · Jingfeng ZHANG · Yujing Hu · Changjie Fan · Chongjie Zhang 🔗 - Converging to Unexploitable Policies in Continuous Control Adversarial Games (Poster)  link »    Fictitious Self-Play (FSP) is an iterative algorithm capable of learning approximate Nash equilibria in many types of two-player zero-sum games. In FSP, at each iteration, a best response is learned to the opponent's meta strategy. However, FSP can be slow to converge in continuous control games in which two embodied agents compete against one another. We propose Adaptive FSP (AdaptFSP), a deep reinforcement learning (RL) algorithm inspired by FSP. The main idea is that instead of training a best response only against the meta strategy, we additionally train against an adaptive deep RL agent that can adapt to the best response. In four test domains, two tabular cases--random normal-form matrix games, Leduc poker--and two continuous control tasks--Thou Shall Not Pass and a soccer environment--we show that AdaptFSP achieves lower exploitability more quickly than vanilla FSP. Link » Maxwell Goldstein · Noam Brown 🔗 - Do As You Teach: A Multi-Teacher Approach to Self-Play in Deep Reinforcement Learning (Poster)  link »    A long-running challenge in the reinforcement learning (RL) community has been to train a goal-conditioned agent in a sparse reward environment such that it could also generalize to other unseen goals. Empirical results in Fetch-Reach and a novel driving simulator demonstrate that our proposed algorithm, Multi-Teacher Asymmetric Self-Play, allows one agent (i.e., a teacher) to create a successful curriculum for another agent (i.e., the student). Surprisingly, results also show that training with multiple teachers actually helps the student learn faster. Our analysis shows that multiple teachers can provide better coverage of the state space, selecting diverse sets of goals, and better helping a student learn. Moreover, results show that completely new students can learn offline from the goals generated by teachers that trained with a previous student. This is crucial in the context of industrial robotics where repeatedly training a teacher agent is expensive and sometimes infeasible. Link » Chaitanya Kharyal · Tanmay Sinha · Vijaya Sai Krishna Gottipati · Srijita Das · Matthew Taylor 🔗 - On All-Action Policy Gradients (Poster)  link »    In this paper, we analyze the variance of stochastic policy gradient with many action samples per state (all-action SPG). We decompose the variance of SPG and derive an optimality condition for all-action SPG. The optimality condition shows when all-action SPG should be preferred over single-action counterpart and allows to determine a variance-minimizing sampling scheme in SPG estimation. Furthermore, we propose dynamics-all-action (DAA) module, an augmentation that allows for all-action sampling without manipulation of the environment. DAA addresses the problems associated with using a Q-network for all-action sampling and can be readily applied to any on-policy SPG algorithm. We find that using DAA with a canonical on-policy algorithm (PPO) yields better sample efficiency and higher policy returns on a variety of challenging continuous action environments. Link » Michal Nauman · Marek Cygan 🔗 - A Connection between One-Step Regularization and Critic Regularization in Reinforcement Learning (Poster)  link »    As with any machine learning problem with limited data, effective offline RL algorithms require careful regularization to avoid overfitting. One-step methods perform regularization by doing just a single step of policy improvement, while critic regularization methods do many steps of policy improvement with a regularized objective. These methods appear distinct. One-step methods, such as advantage-weighted regression and conditional behavioral cloning, are simple and stable. Critic regularization is more challenging to implement correctly and typically requires more compute, but has appealing lower-bound guarantees. Empirically, prior work alternates between claiming better results with one-step RL and critic regularization. In this paper, we draw a close connection between these methods: applying a multi-step critic regularization method with a large regularization coefficient yields the same policy as one-step RL. Practical implementations violate our assumptions and critic regularization is typically applied with small regularization coefficients. Nonetheless, our experiments nevertheless show that our analysis makes accurate, testable predictions about practical offline RL methods (CQL and one-step RL) with commonly-used hyperparameters. Link » Benjamin Eysenbach · Matthieu Geist · Russ Salakhutdinov · Sergey Levine 🔗 - The Benefits of Model-Based Generalization in Reinforcement Learning (Poster)  link »    Model-Based Reinforcement Learning (RL) is widely believed to have the potential to improve sample efficiency by allowing an agent to synthesize large amounts of imagined experience. Experience Replay (ER) can be considered a simple kind of model, which has proved extremely effective at improving the stability and efficiency of deep RL. In principle, a learned parametric model could improve on ER by generalizing from real experience to augment the dataset with additional plausible experience. However, owing to the many design choices involved in empirically successful algorithms, it can be very hard to establish where the benefits are actually coming from. Here, we provide theoretical and empirical insight into when, and how, we can expect data generated by a learned model to be useful. First, we provide a general theorem motivating how learning a model as an intermediate step can narrow down the set of possible value functions more than learning a value function directly from data using the Bellman equation. Second, we provide an illustrative example showing empirically how a similar effect occurs in a more concrete setting with neural network function approximation. Finally, we provide extensive experiments showing the benefit of model-based learning for online RL in environments with combinatorial complexity, but factored structure that allows a learned model to generalize. In these experiments, we take care to control for other factors in order to isolate, insofar as possible, the benefit of using experience generated by a learned model relative to ER alone. Link » Kenny Young · Aditya Ramesh · Louis Kirsch · Jürgen Schmidhuber 🔗 - Training graph neural networks with policy gradients to perform tree search (Poster)  link »    Monte Carlo Tree Search has shown to be a well-performing approach for decision problems such as board games and Atari games, but relies on heuristic design decisions that are non-adaptive and not necessarily optimal for all problems. Learned policies and value functions can augment MCTS by leveraging the state information at the nodes in the search tree. However, these learned functions do not take the search tree structure into account and can be sensitive to value estimation errors. In this paper, we propose a new method that, using Reinforcement Learning, learns how to expand the search tree and make decisions using Graph Neural Networks. This enables the policy to fully leverage the search tree and learn how to search based on the specific problem. Firstly, we show in an environment where state information is limited that the policy is able to leverage information from the search tree. Concluding, we find that the method outperforms popular baselines on two diverse and problems known to require planning: Sokoban and the Travelling salesman problem. Link » Matthew Macfarlane · Diederik Roijers · Herke van Hoof 🔗 - Co-Imitation: Learning Design and Behaviour by Imitation (Poster)  link »    The co-adaptation of robots has been a long-standing research endeavour with the goal of adapting both body and behaviour of a system for a given task, inspired by the natural evolution of animals. Co-adaptation has the potential to eliminate costly manual hardware engineering as well as improve the performance of systems.The standard approach to co-adaptation is to use a reward function for optimizing behaviour and morphology. However, defining and constructing such reward functions is notoriously difficult and often a significant engineering effort.This paper introduces a new viewpoint on the co-adaptation problem, which we call co-imitation: finding a morphology and a policy that allow an imitator to closely match the behaviour of a demonstrator.To this end we propose a co-imitation methodology for adapting behaviour and morphology by matching state distributions of the demonstrator. Specifically, we focus on the challenging scenario with mismatched state- and action-spaces between both agents.We find that co-imitation increases behaviour similarity across a variety of tasks and settings, and demonstrate co-imitation by transferring human walking, jogging and kicking skills onto a simulated humanoid. Link » Chang Rajani · Karol Arndt · David Blanco-Mulero · Kevin Sebastian Luck · Ville Kyrki 🔗 - Rewarding Episodic Visitation Discrepancy for Exploration in Reinforcement Learning (Poster)  link »    Exploration is critical for deep reinforcement learning in complex environments with high-dimensional observations and sparse rewards. To address this problem, recent approaches proposed to leverage intrinsic rewards to improve exploration, such as novelty-based exploration and prediction-based exploration. However, many intrinsic reward modules require sophisticated structures and representation learning, resulting in prohibitive computational complexity and unstable performance. In this paper, we propose Rewarding Episodic Visitation Discrepancy (REVD), a computation-efficient and quantified exploration method. More specifically, REVD provides intrinsic rewards by evaluating the Rényi divergence-based visitation discrepancy between episodes. To estimate the divergence efficiently, a $k$-nearest neighbor estimator is utilized with a randomly-initialized state encoder. Finally, the REVD is tested on Atari games and PyBullet Robotics Environments. Extensive experiments demonstrate that REVD can significantly improve the sample efficiency of reinforcement learning algorithms and outperform the benchmarking methods. Link » Mingqi Yuan · Bo Li · Xin Jin · Wenjun Zeng 🔗 - BLaDE: Robust Exploration via Diffusion Models (Poster)  link »    We present Bootstrap your own Latents with Diffusion models for Exploration (BLaDE), a general approach for curiosity-driven exploration in complex, partially-observable and stochastic environments. BLaDE is a natural extension of Bootstrap Your Own Latents for Exploration (BYOL-Explore) which is a multi-step prediction-error method at the latent level that learns a world representation, the world dynamics, and provides an intrinsic-reward all-together by optimizing a single prediction loss with no additional auxiliary objective. Contrary to BYOL-Explore that predicts future latents from past latents and future open-loop actions, BLaDE predicts, via a diffusion model, future latents from past observations, future open-loop actions and a noisy version of future latents. Leaking information about future latents allows to obtain an intrinsic reward that does not depend on the variance of the distribution of future latents which makes the method agnostic to stochastic traps. Our experiments on different noisy versions of Montezuma's Revenge show that BLaDE handles stochasticity better than Random Network Distillation, Intrinsic Curiosity Module and BYOL-Explore without degrading the performance of BYOL-Explore in the non-noisy and fairly deterministic Link » Bilal Piot · Zhaohan Guo · Shantanu Thakoor · Mohammad Gheshlaghi Azar 🔗 - Learning Semantics-Aware Locomotion Skills from Human Demonstrations (Poster)  link »    The semantics of the environment, such as the terrain types and properties, reveal important information for legged robots to adjust their behaviors. In this work, we present a framework that uses semantic information from RGB images to adjust the speeds and gaits for quadrupedal robots, such that the robot can traversethrough complex offroad terrains. Due to the lack of high-fidelity offroad simulation, our framework needs to be trained directly in the real world, which brings unique challenges in sample efficiency and safety. To ensure sample efficiency, we pre-train the perception model on an off-road driving dataset. To avoid the risks of real-world policy exploration, we leverage human demonstration to train a speed policy that selects a desired forward speed from camera image. For maximum traversability, we pair the speed policy with a gait selector, which selects a robust locomotion gait for each forward speed. Using only 40 minutes of human demonstration data, our framework learns to adjust the speed and gait of the robot based on perceived terrain semantics, and enables the robot to walk over 6km safely and efficiently. Link » Yuxiang Yang · Xiangyun Meng · Wenhao Yu · Tingnan Zhang · Jie Tan · Byron Boots 🔗 - Imitation from Observation With Bootstrapped Contrastive Learning (Poster)  link » Imitation from observation is a paradigm that consists of training agents using observations of expert demonstrations without direct access to the actions. Depending on the problem configuration, these demonstrations can be sequences of states or raw visual observations.One of the most common procedures adopted to solve this problem is to train a reward function from the demonstrations, but this task still remains a significant challenge.We approach this problem with a method of agent behavior representation in a latent space using demonstration videos.Our approach exploits recent algorithms of contrastive learning of image and video and uses a bootstrapping method to progressively train a trajectory encoding function with respect to the variation of the agent policy. This function is then used to compute the rewards provided to a standard Reinforcement Learning (RL) algorithm.Our method uses only a limited number of videos produced by an expert and we do not have access to the expert policy function.Our experiments show promising results on a set of continuous control tasks and demonstrate that learning a behavior encoder from videos allows building an efficient reward function for the agent. Link » Medric Sonwa · Johanna Hansen · Eugene Belilovsky 🔗 - PD-MORL: Preference-Driven Multi-Objective Reinforcement Learning Algorithm (Poster)  link »    Multi-objective reinforcement learning (MORL) approaches have emerged to tackle many real-world problems with multiple conflicting objectives by maximizing a joint objective function weighted by a preference vector. These approaches find fixed customized policies corresponding to preference vectors specified during training. However, the design constraints and objectives typically change dynamically in real-life scenarios. Furthermore, storing a policy for each potential preference is not scalable. Hence, obtaining a set of Pareto front solutions for the entire preference space in a given domain with a single training is critical. To this end, we propose a novel MORL algorithm that trains a single universal network to cover the entire preference space scalable to continuous robotic tasks. The proposed approach, Preference-Driven MORL (PD-MORL), utilizes the preferences as guidance to update the network parameters. It also employs a novel parallelization approach to increase sample efficiency. We show that PD-MORL achieves up to $25\%$ larger hypervolume for challenging continuous control tasks compared to prior approaches using an order of magnitude fewer trainable parameters while achieving broad and dense Pareto front solutions. Link » Toygun Basaklar · Suat Gumussoy · Umit Ogras 🔗 - Improving Assistive Robotics with Deep Reinforcement Learning (Poster)  link » Assistive Robotics is a class of robotics concerned with aiding humans in daily care tasks that they may be inhibited from doing due to disabilities or age. While research has demonstrated that classical control methods can be used to design policies to complete these tasks, these methods can be difficult to generalize to a variety of instantiations of a task. Reinforcement learning can provide a solution to this issue, wherein robots are trained in simulation and their policies are transferred to real-world machines. In this work, we replicate a published baseline for training robots on three tasks in the Assistive Gym environment, and we explore the usage of a Recurrent Neural Network policy and Phasic Policy Gradient learning to augment the original work. Our baseline implementation meets or exceeds the baseline of the original work, however, we found that our explorations into the new methods was not as effective as we anticipated. We discuss the results of our baseline and analyze why our new methods were not as successful. Link » Yash Jakhotiya · Iman Haque 🔗 - Selectively Sharing Experiences Improves Multi-Agent Reinforcement Learning (Poster)  link »    We present a novel multi-agent RL approach, Selective Multi-Agent PER, in which agents share with other agents a limited number of transitions they observe during training. They follow a similar heuristic as is used in (single-agent) Prioritized Experience Replay, and choose those transitions based on their td-error. The intuition behind this is that even a small number of relevant experiences from other agents could help each agent learn. Unlike many other multi-agent RL algorithms, this approach allows for largely decentralized training, requiring only a limited communication channel between agents. We show that our approach outperforms baseline no-sharing decentralized training. Further, sharing only a small number of experiences outperforms sharing all experiences between agents, and the performance uplift from selective experience sharing is robust across a range of hyperparameters. Link » Matthias Gerstgrasser · Tom Danino · Sarah Keren 🔗 - Pretraining the Vision Transformer using self-supervised methods for vision based Deep Reinforcement Learning (Poster)  link »    The Vision Transformer architecture has shown to be competitive in the computer vision (CV) space where it has dethroned convolution-based networks in several benchmarks. Nevertheless, Convolutional Neural Networks (CNN) remain the preferential architecture for the representation module in Reinforcement Learning. In this work, we study pretraining a Vision Transformer using several state-of-the-art self-supervised methods and assess data-efficiency gains from this training framework. We propose a new self-supervised learning method called TOV-VICReg that extends VICReg to better capture temporal relations between observations by adding a temporal order verification task. Furthermore, we evaluate the resultant encoders with Atari games in a sample-efficiency regime. Our results show that the vision transformer, when pretrained with TOV-VICReg, outperforms the other self-supervised methods but still struggles to overcome a CNN. Nevertheless, we were able to outperform a CNN in two of the ten games where we perform a 100k steps evaluation. Ultimately, we believe that such approaches in Deep Reinforcement Learning (DRL) might be the key to achieving new levels of performance as seen in natural language processing and computer vision. Link » Manuel Goulão · Arlindo L Oliveira 🔗 - Variance Reduction in Off-Policy Deep Reinforcement Learning using Spectral Normalization (Poster)  link »    Off-policy deep reinforcement learning algorithms like Soft Actor Critic (SAC) have achieved state-of-the-art results in several high dimensional continuous control tasks. Despite their success, they are prone to instability due to the \textit{deadly triad} of off-policy training, function approximation, and bootstrapping. Unstable training of off-policy algorithms leads to sample inefficient and sub-optimal asymptotic performance, thus preventing their real-world deployment. To mitigate these issues, previously proposed solutions have focused on advances like target networks to alleviate instability and the introduction of twin critics to address overestimation bias. However, these modifications fail to address the issue of noisy gradient estimation with excessive variance, resulting in instability and slow convergence. Our proposed method, Spectral Normalized Actor Critic (SNAC), regularizes the actor and the critics using spectral normalization to systematically bound the gradient norm. Spectral normalization constrains the magnitudes of the gradients resulting in smoother actor-critics with robust and sample-efficient performance thus making them suitable for deployment in stability-critical and compute-constrained applications. We present empirical results on several challenging reinforcement learning benchmarks and extensive ablation studies to demonstrate the effectiveness of our proposed method. Link » Payal Bawa · Rafael Oliveira · Fabio Ramos 🔗 - Planning Immediate Landmarks of Targets for Model-Free Skill Transfer across Agents (Poster)  link » In reinforcement learning applications, agents usually need to deal with various input/output features when specified with different state and action spaces by their developers or physical restrictions, indicating re-training from scratch and considerable sample inefficiency, especially when agents follow similar solution steps to achieve tasks.In this paper, we aim to transfer pre-trained skills to alleviate the above challenge. Specifically, we propose PILoT, i.e., Planning Immediate Landmarks of Targets. PILoT utilizes the universal decoupled policy optimization to learn a goal-conditioned state planner; then, we distill a goal-planner to plan immediate landmarks in a model-free style that can be shared among different agents. In our experiments, we show the power of PILoT on various transferring challenges, including few-shot transferring across action spaces and dynamics, from low-dimensional vector states to image inputs, from simple robot to complicated morphology; and we also illustrate PILoT provides a zero-shot transfer solution from a simple 2D navigation task to the harder Ant-Maze task. Link » Minghuan Liu · Zhengbang Zhu · Menghui Zhu · Yuzheng Zhuang · Weinan Zhang · Jianye Hao 🔗 - Guided Skill Learning and Abstraction for Long-Horizon Manipulation (Poster)  link »    To assist with everyday human activities, robots must solve complex long-horizon tasks and generalize to new settings. Recent deep reinforcement learning (RL) methods show promises in fully autonomous learning, but they struggle to reach long-term goals in large environments. On the other hand, Task and Motion Planning (TAMP) approaches excel at solving and generalizing across long-horizon tasks, thanks to their powerful state and action abstractions. But they assume predefined skill sets, which limits their real-world applications. In this work, we combine the benefits of these two paradigms and propose an integrated task planning and skill learning framework named LEAGUE (Learning and Abstraction with Guidance). LEAGUE leverages symbolic interface of a task planner to guide RL-based skill learning and creates abstract state space to enable skill reuse. More importantly, LEAGUE learns manipulation skills in-situ of the task planning system, continuously growing its capability and the set of tasks that it can solve. We demonstrate LEAGUE on three challenging simulated task domains and show that LEAGUE outperforms baselines by a large margin, and that the learned skills can be reused to accelerate learning in new tasks and domains. Additional resource is available at https://bit.ly/3eUOx4N. Link » Shuo Cheng · Danfei Xu 🔗 - Locally Constrained Representations in Reinforcement Learning (Poster)  link »    The success of Reinforcement Learning (RL) heavily relies on the ability to learn robust representations from the observations of the environment. In most cases, the representations learned purely by the reinforcement learning loss can differ vastly across states depending on how the value functions change. However, the representations learned need not be very specific to the task at hand. Relying only on the RL objective may yield representations that vary greatly across successive time steps. In addition, since the RL loss has a changing target, the representations learned would depend on how good the current values/policies are. Thus, disentangling the representations from the main task would allow them to focus more on capturing transition dynamics which can improve generalization. To this end, we propose locally constrained representations, where an auxiliary loss forces the state representations to be predictable by the representations of the neighbouring states. This encourages the representations to be driven not only by the value/policy learning but also self-supervised learning, which constrains the representations from changing too rapidly. We evaluate the proposed method on several known benchmarks and observe strong performance. Especially in continuous control tasks, our experiments show a significant advantage over a strong baseline. Link » Somjit Nath · Samira Ebrahimi Kahou 🔗 - Sample-efficient Adversarial Imitation Learning (Poster)  link »    Imitation learning, wherein learning is performed by demonstration, has been studied and advanced for sequential decision-making tasks in which a reward function is not predefined. However, imitation learning methods still require numerous expert demonstration samples to successfully imitate an expert's behavior. To improve sample efficiency, we utilize self-supervised representation learning, which can generate vast training signals from the given data. In this study, we propose a self-supervised representation-based adversarial imitation learning method to learn state and action representations that are robust to diverse distortions and temporally predictive, on non-image control tasks. Particularly, in comparison with existing self-supervised learning methods for tabular data, we propose a different corruption method for state and action representations robust to diverse distortions. The proposed method shows a 39% relative improvement over the existing adversarial imitation learning methods on MuJoCo in a setting limited to 100 expert state-action pairs. Moreover, we conduct comprehensive ablations and additional experiments using demonstrations with varying optimality to provide the intuitions of a range of factors. Link » Dahuin Jung · Hyungyu Lee · Sungroh Yoon 🔗 - Prioritizing Samples in Reinforcement Learning with Reducible Loss (Poster)  link »    Most reinforcement learning algorithms take advantage of an experience replay buffer to repeatedly train on samples the agent has observed in the past. This prevents catastrophic forgetting, however simply assigning equal importance to each of the samples is a naive strategy. In this paper, we propose a method to prioritize samples based on how much we can learn from a sample. We define the learn-ability of a sample as the steady decrease of the training loss associated with this sample over time. We develop an algorithm to prioritize samples with high learn-ability, while assigning lower priority to those that are hard-to-learn, typically caused by noise or stochasticity. We empirically show that our method is more robust than random sampling and also better than just prioritizing with respect to the training loss, i.e. the temporal difference loss, which is used in vanilla prioritized experience replay. Link » Shivakanth Sujit · Somjit Nath · Pedro Braga · Samira Ebrahimi Kahou 🔗 - PCRL: Priority Convention Reinforcement Learning for Microscopically Sequencable Multi-agent Problems (Poster)  link »    Reinforcement learning (RL) has played an important role in tackling the decision problems emerging from agent fields. However, RL still has challenges in tackling multi-agent large-discrete-action-space (LDAS) problems, possibly resulting from large agent numbers. At each decision step, a multi-agent LDAS problem is often faced with an unaffordable number of candidate actions. Existing work has mainly tackled these challenges utilizing indirect approaches such as continuation relaxation and sub-sampling, which may lack solution quality guarantees from continuation to discretization. In this work, we propose to embed agreed priority conventions into reinforcement learning (PCRL) to directly tackle the microscopically sequenceable multi-agent LDAS problems. Priority conventions include position-based agent priority to break symmetries and prescribed action priority to break ties. In a microscopically sequenceable multi-agent problem, the centralized planner, at each decision step of the whole system, generates an action vector (each component of the vector is for an agent and is generated in a micro-step) by considering the conventions. The action vector is generated sequentially when microscopically viewed, and such generation will not miss the optimal action vector, and can help RL's exploitation around the lexicographic-smallest optimal action vector. Proper learning schemes and action-selection schemes have been designed to make the embedding reality. The effectiveness and superiority of PCRL have been validated by experiments on multi-agent applications, including the multi-agent complete coverage planning application (involving up to $4^{18}>6.8\times 10^{10}$ candidate actions at each decision step) and the cooperative pong game (state-based and pixel-based, respectively), showing PCRL's LDAS dealing ability and high optimality-finding ability than the joint-action RL methods and heuristic algorithms. Link » Xing Zhou · Hao Gao · Xin Xu · Xinglong Zhang · Hongda Jia · Dongzi Wang 🔗 - A General Framework for Sample-Efficient Function Approximation in Reinforcement Learning (Poster)  link »    With the increasing need for handling large state and action spaces, general function approximation has become a key technique in reinforcement learning problems. In this paper, we propose a unified framework that integrates both model-based and model-free reinforcement learning and subsumes nearly all Markov decision process (MDP) models in the existing literature for tractable RL. We propose a novel estimation function with decomposable structural properties for optimization-based exploration and use the functional Eluder dimension with respect to an admissible Bellman characterization function as a complexity measure of the model class. Under our framework, a new sample-efficient algorithm namely OPtimization-based ExploRation with Approximation (OPERA) is proposed, achieving regret bounds that match or improve over the best-known results for a variety of MDP models. In particular, for MDPs with low Witness rank, under a slightly stronger assumption, OPERA improves the state-of-the-art sample complexity results by a factor of $dH$. Our framework provides a generic interface to study and design new RL models and algorithms. Link » Zixiang Chen · Chris Junchi Li · Angela Yuan · Quanquan Gu · Michael Jordan 🔗 - Simplifying Model-based RL: Learning Representations, Latent-space Models, and Policies with One Objective (Poster)  link »    While reinforcement learning (RL) methods that learn an internal model of the environment have the potential to be more sample efficient than their model-free counterparts, learning to model raw observations from high dimensional sensors can be challenging. Prior work has addressed this challenge by learning low-dimensional representation of observations through auxiliary objectives, such as reconstruction or value prediction. However, the alignment between these auxiliary objectives and the RL objective is often unclear. In this work, we propose a single objective which jointly optimizes a latent-space model and policy to achieve high returns while remaining self-consistent. This objective is a lower bound on expected returns. Unlike prior bounds for model-based RL on policy exploration or model guarantees, our bound is directly on the overall RL objective. We demonstrate that the resulting algorithm matches or improves the sample-efficiency of the best prior model-based and model-free RL methods. While such sample efficient methods typically are computationally demanding, our method attains the performance of SAC in about 50% less wall-clock time. Link » Raj Ghugare · Homanga Bharadhwaj · Benjamin Eysenbach · Sergey Levine · Ruslan Salakhutdinov 🔗 - Value-based CTDE Methods in Symmetric Two-team Markov Game: from Cooperation to Team Competition (Poster)  link »    In this paper, we identify the best training scenario to train a team of agents to compete against multiple possible strategies of opposing teams.We restrict ourselves to the case of a symmetric two-team Markov game which is a competition between two symmetric teams.We evaluate cooperative value-based methods in a mixed cooperative-competitive environment.We selected three training methods based on the centralised training and decentralised execution (CTDE) paradigm: QMIX, MAVEN and QVMix.To train such teams, we modified the StarCraft Multi-Agent Challenge environment to create competitive scenarios where both teams could learn and compete simultaneously in a partially observable environment.For each method, we considered three learning scenarios differentiated by the variety of team policies encountered during training.Our results suggest that training against multiple evolving strategies achieves the best results when, for scoring their performances, teams are faced with several strategies, whether the stationary strategy is better than all trained teams or not. Link » Pascal Leroy · Jonathan Pisane · Damien Ernst 🔗 - Reinforcement Learning in System Identification (Poster)  link »    System identification, also known as learning forward models, transfer functions, system dynamics, etc., has a long tradition both in science and engineering in different fields. Particularly, it is a recurring theme in Reinforcement Learning research, where forward models approximate the state transition function of a Markov Decision Process by learning a mapping function from current state and action to the next state. This problem is commonly defined as a Supervised Learning problem in a direct way. This common approach faces several difficulties due to the inherent complexities of the dynamics to learn, for example, delayed effects, high non-linearity, non-stationarity, partial observability and, more important, error accumulation when using bootstrapped predictions (predictions based on past predictions), over large time horizons. Here we explore the use of Reinforcement Learning in this problem. We elaborate on why and how this problem fits naturally and sound as a Reinforcement Learning problem, and present some experimental results that demonstrate RL is a promising technique to solve these kind of problems. Link » Jose Martin H. · Óscar Fernandez Vicente · Sergio Perez · Anas Belfadil · Cristina Ibanez-Llano · Freddy Perozo Rondón · Jose Valle · Javier Arechalde Pelaz 🔗 - Robust Option Learning for Adversarial Generalization (Poster)  link »    Compositional reinforcement learning is a promising approach for training policies to perform complex long-horizon tasks. Typically, a high-level task is decomposed into a sequence of subtasks and a separate policy is trained to perform each subtask. In this paper, we focus on the problem of training subtask policies in a way that they can be used to perform any task; here, a task is given by a sequence of subtasks. We aim to maximize the worst-case performance over all tasks as opposed to the average-case performance. We formulate the problem as a two agent zero-sum game in which the adversary picks the sequence of subtasks. We propose two RL algorithms to solve this game: one is an adaptation of existing multi-agent RL algorithms to our setting and the other is an asynchronous version which enables parallel training of subtask policies. We evaluate our approach on two multi-task environments with continuous states and actions and demonstrate that our algorithms outperform state-of-the-art baselines. Link » Kishor Jothimurugan · Steve Hsu · Osbert Bastani · Rajeev Alur 🔗 - Biological Neurons vs Deep Reinforcement Learning: Sample efficiency in a simulated game-world (Poster)  link »    How do synthetic biological systems and artificial neural networks compete in their performance in a game environment? Reinforcement learning has undergone significant advances, however remains behind biological neural intelligence in terms of sample efficiency. Yet most biological systems are significantly more complicated than most algorithms. Here we compare the inherent intelligence of in vitro biological neuronal networks to state-of-the-art deep reinforcement learning algorithms in the arcade game 'pong'. We employed DishBrain, a system that embodies in vitro neural networks with in silico computation using a high-density multielectrode array. We compared the learning curve and the performance of these biological systems against time-matched learning from DQN, A2C, and PPO algorithms. Agents were implemented in a reward-based environment of the Pong' game. Key learning characteristics of the deep reinforcement learning agents were tested with those of the biological neuronal cultures in the same game environment. We find that even these very simple biological cultures typically outperform deep reinforcement learning systems in terms of various game performance characteristics, such as the average rally length implying a higher sample efficiency. Furthermore, the human cell cultures proved to have the overall highest relative improvement in the average number of hits in a rally when comparing the initial 5 minutes and the last 15 minutes of each designed gameplay session. Link » Forough Habibollahi · Moein Khajehnejad · Amitesh Gaurav · Brett J. Kagan 🔗 - Inducing Functions through Reinforcement Learning without Task Specification (Poster)  link »    We report a bio-inspired approach for training a neural network through reinforcement learning to induce high level functions within the network. Based on the interpretation that animals have gained their cognitive functions such as object recognition — without ever being specifically trained for — as a result of maximizing their fitness to the environment, we place our agent in a custom environment where developing certain functions may facilitate decision making; the custom environment is designed as a partially observable Markov decision process in which an input image and the initial value of hidden variables are given to the agent at each time step. We show that our agent, which consists of a convolutional neural network, a recurrent neural network, and a multilayer perceptron, learns to classify the input image and to predict the hidden variables. The experimental results show that high level functions, such as image classification and hidden variable estimation, can be naturally and simultaneously induced without any pre-training or specifying them. Link » Junmo Cho · Donghwan Lee · Young-Gyu Yoon 🔗 - Learning Successor Feature Representations to Train Robust Policies for Multi-task Learning (Poster)  link » The deep reinforcement learning (RL) framework has shown great promise to tackle sequential decision-making problems, where the agent learns to behave optimally through interactions with the environment and receiving rewards. The ability of an RL agent to learn different reward functions concurrently has many benefits, such as the decomposition of task rewards and promoting skill reuse. In this paper, we consider the problem of continuous control for robot manipulation tasks with an explicit representation that promotes skill reuse while learning multiple tasks with similar reward functions. Our approach relies on two key concepts: successor features (SFs), a value function representation that decouples the dynamics of the environment from the rewards, and an actor-critic framework that incorporates the learned SFs representation.SFs form a natural bridge between model-based and model-free RL methods. We first show how to learn a decomposable representation required by SFs as a pre-training stage. The proposed architecture is able to learn decoupled state and reward feature representations for non-linear reward functions. We then evaluate the feasibility of integrating SFs into an actor-critic framework, which is more tailored for tasks solved with deep RL algorithms. The approach is empirically tested on non-trivial continuous control problems with compositional structure built into the reward functions of the tasks. Link » Melissa Mozifian · Dieter Fox · David Meger · Fabio Ramos · Animesh Garg 🔗 - Automated Dynamics Curriculums for Deep Reinforcement Learning (Poster)  link » Humans often make the dynamics of a task easier (e.g. using training wheels on a bicycle or a large voluminous surfboard) when first learning a skill before tackling the full task with more difficult dynamics (riding a bike without training wheels, surfing a smaller board). This can be thought of as a form of curriculum learning. However, this is not the paradigm currently used for training agents using reinforcement learning (RL). In many cases, agents are thrown into the final environment, and must learn a policy from scratch in the context of the final dynamics. While previous work on curriculum learning for deep RL has sought to address this problem by changing the tasks agents are solving, or the starting position of the agent, no work has derived a curriculum by modifying the dynamics of the final environment. Here, we study using assist - simplifying task dynamics - to accelerate and improve the learning process for RL agents. First, we modify the physics of theLunarLander-v2 and FetchReach-v1 environments to allow us to adjust the amount of assist provided with a single parameter $\alpha$, which scales the amount which an agent is nudged and hence assisted towards a known end goal during training. We then show that we can automatically learn schedules for assist using a population based training approach that results in faster agent convergence on the evaluation environment without any assist, and better performance across continuous control tasks using state of the art policy gradient algorithms (proximal policy optimization). We show that our method can also scale to off policy methods such as Deep Deterministic Policy Gradients. Furthermore, we show that for tasks with sparse rewards, assist is critical to agent learning as it allows exploration of high-reward areas and use of algorithms that fail to learn the task without assist. We also uncover that population based tuning approaches stabilize training of policy gradients without tuning of any additional hyperparameters. Link » Sean Metzger 🔗 - Supervised Q-Learning for Continuous Control (Poster)  link »    Policy gradient (PG) algorithms have been widely used in reinforcement learning (RL). However, PG algorithms rely on exploiting the value function being learned with the first-order update locally, which results in limited sample efficiency. In this work, we propose an alternative method called Zeroth-Order Supervised Policy Improvement (ZOSPI). ZOSPI exploits the estimated value function $Q$ globally while preserving the local exploitation of the PG methods based on zeroth-order policy optimization. This learning paradigm follows Q-learning but overcomes the difficulty of efficiently operating argmax in continuous action space. It finds max-valued action within a small number of samples. The policy learning of ZOSPI has two steps: First, it samples actions and evaluates those actions with a learned value estimator, and then it learns to perform the action with the highest value through supervised learning. We further demonstrate such a supervised learning framework can learn multi-modal policies. Experiments show that ZOSPI achieves competitive results on the continuous control benchmarks with a remarkable sample efficiency. Link » Hao Sun · Ziping Xu · Taiyi Wang · Meng Fang · Bolei Zhou 🔗 - MOPA: a Minimalist Off-Policy Approach to Safe-RL (Poster)  link »    Safety is one of the crucial concerns for the real-world application of reinforcement learning (RL). Previous works consider the safe exploration problem as Constrained Markov Decision Process (CMDP), where the policies are being optimized under constraints. However, when encountering any potential danger, human tends to stop immediately and rarely learns to behave safely in danger. Moreover, the off-policy learning nature of humans guarantees high learning efficiency in risky tasks. Motivated by human learning, we introduce a Minimalist Off-Policy Approach (MOPA) to address Safe-RL problem. We first define the Early Terminated MDP (ET-MDP) as a special type of MDPs that has the same optimal value function as its CMDP counterpart. An off-policy learning algorithm MOPA based on recurrent models is then proposed to solve the ET-MDP, which thereby solves the corresponding CMDP. Experiments on various Safe-RL tasks show a substantial improvement over previous methods that directly solve CMDP, in terms of higher asymptotic performance and better learning efficiency. Link » Hao Sun · Ziping Xu · Zhenghao Peng · Meng Fang · Bo Dai · Bolei Zhou 🔗 - Novel Policy Seeking with Constrained Optimization (Poster)  link »    In problem-solving, we humans tend to come up with different novel solutions to the same problem. However, conventional reinforcement learning algorithms ignore such a feat and only aim at producing a set of monotonous policies that maximize the cumulative reward. The resulting policies usually lack diversity and novelty. In this work, we aim at enabling the learning algorithms with the capacity of solving the task with multiple solutions through a practical novel policy generation workflow that can generate a set of diverse and well-performing policies. Specifically, we begin by introducing a new metric to evaluate the difference between policies. On top of this well-defined novelty metric, we propose to rethink the novelty-seeking problem through the lens of constrained optimization, to address the dilemma between the task performance and the behavioral novelty in existing multi-objective optimization approaches, we then propose a practical novel policy-seeking algorithm, Interior Policy Differentiation (IPD), which is derived from the interior point method commonly known in the constrained optimization literature. Experimental comparisons on benchmark environments show IPD can achieve a substantial improvement over previous novelty-seeking methods in terms of both novelties of generated policies and their performances in the primal task. Link » Hao Sun · Zhenghao Peng · Bolei Zhou 🔗 - Toward Causal-Aware RL: State-Wise Action-Refined Temporal Difference (Poster)  link »    Although it is well known that exploration plays a key role in Reinforcement Learning (RL), prevailing exploration strategies for continuous control tasks in RL are mainly based on naive isotropic Gaussian noise regardless of the causality relationship between action space and the task and consider all dimensions of actions equally important. In this work, we propose to conduct interventions on the primal action space to discover the causal relationship between the action space and the task reward. We propose the method of State-Wise Action Refined (SWAR), which addresses the issue of action space redundancy and promote causality discovery in RL. We formulate causality discovery in RL tasks as a state-dependent action space selection problem and propose two practical algorithms as solutions. The first approach, TD-SWAR, detects task-related actions during temporal difference learning, while the second approach, Dyn-SWAR, reveals important actions through dynamic model prediction. Empirically, both methods provide approaches to understand the decisions made by RL agents and improve learning efficiency in action-redundant tasks. Link » Hao Sun · Taiyi Wang 🔗

#### Author Information

##### Risto Vuorio (University of Oxford)

I'm a PhD student in WhiRL at University of Oxford. I'm interested in reinforcement learning and meta-learning.