Timezone: »

Workshop
Deep Reinforcement Learning
Pieter Abbeel · Chelsea Finn · David Silver · Matthew Taylor · Martha White · Srijita Das · Yuqing Du · Andrew Patterson · Manan Tomar · Olivia Watkins

Mon Dec 13 08:55 AM -- 06:00 PM (PST) @ None

In recent years, the use of deep neural networks as function approximators has enabled researchers to extend reinforcement learning techniques to solve increasingly complex control tasks. The emerging field of deep reinforcement learning has led to remarkable empirical results in rich and varied domains like robotics, strategy games, and multiagent interactions. This workshop will bring together researchers working at the intersection of deep learning and reinforcement learning, and it will help interested researchers outside of the field gain perspective about the current state of the art and potential directions for future contributions.

 Mon 8:55 a.m. - 9:00 a.m. Welcome and Introduction (Welcoming Notes) 🔗 Mon 9:00 a.m. - 9:12 a.m. Implicit Behavioral Cloning (Oral) []   link »    We find that across a wide range of robot policy learning scenarios, treating supervised policy learning with an implicit model generally performs better, on average, than commonly used explicit models. We present extensive experiments on this finding, and we provide both intuitive insight and theoretical arguments distinguishing the properties of implicit models compared to their explicit counterparts, particularly with respect to approximating complex, potentially discontinuous and multi-valued (set-valued) functions. On robotic policy learning tasks we show that implicit behavioral cloning policies with energy-based models (EBM) often outperform common explicit (Mean Square Error, or Mixture Density) counterparts, including on tasks with high-dimensional action spaces and visual image inputs. We find these policies provide competitive results or outperform state-of-the-art offline reinforcement learning methods on the challenging human-expert tasks from the D4RL benchmark suite, despite using no reward information. In the real world, robots with implicit policies can learn complex and remarkably subtle behaviors on contact-rich tasks from human demonstrations, including tasks with high combinatorial complexity and tasks requiring 1mm precision. Link » Pete Florence · Corey Lynch · Andy Zeng · Oscar Ramirez · Ayzaan Wahid · Laura Downs · Adrian Wong · Igor Mordatch · Jonathan Tompson 🔗 Mon 9:12 a.m. - 9:15 a.m. Implicit Behavioral Cloning Q&A (Q&A)  link » We find that across a wide range of robot policy learning scenarios, treating supervised policy learning with an implicit model generally performs better, on average, than commonly used explicit models. We present extensive experiments on this finding, and we provide both intuitive insight and theoretical arguments distinguishing the properties of implicit models compared to their explicit counterparts, particularly with respect to approximating complex, potentially discontinuous and multi-valued (set-valued) functions. On robotic policy learning tasks we show that implicit behavioral cloning policies with energy-based models (EBM) often outperform common explicit (Mean Square Error, or Mixture Density) counterparts, including on tasks with high-dimensional action spaces and visual image inputs. We find these policies provide competitive results or outperform state-of-the-art offline reinforcement learning methods on the challenging human-expert tasks from the D4RL benchmark suite, despite using no reward information. In the real world, robots with implicit policies can learn complex and remarkably subtle behaviors on contact-rich tasks from human demonstrations, including tasks with high combinatorial complexity and tasks requiring 1mm precision. Link » Pete Florence · Corey Lynch · Andy Zeng · Oscar Ramirez · Ayzaan Wahid · Laura Downs · Adrian Wong · Igor Mordatch · Jonathan Tompson 🔗 Mon 9:15 a.m. - 9:27 a.m. DR3: Value-Based Deep Reinforcement Learning Requires Explicit Regularization (Oral) []   link »    Despite overparameterization, deep networks trained via supervised learning are surprisingly easy to optimize and exhibit excellent generalization. One hypothesis to explain this is that overparameterized deep networks enjoy the benefits of implicit regularization induced by stochastic gradient descent, which favors parsimonious solutions that generalize well on test inputs. It is reasonable to surmise that deep reinforcement learning (RL) methods could also benefit from this effect. In this paper, we discuss how the implicit regularization effect of SGD seen in supervised learning could in fact be harmful in the offline deep RL setting, leading to poor generalization and degenerate feature representations. Our theoretical analysis shows that when existing models of implicit regularization are applied to temporal difference learning, the resulting derived regularizer favors degenerate solutions with excessive aliasing, in stark contrast to the supervised learning case. We back up these findings empirically, showing that feature representations learned by a deep network value function trained via bootstrapping can indeed become degenerate, aliasing the representations for state-action pairs that appear on either side of the Bellman backup. To address this issue, we derive the form of this implicit regularizer and, inspired by this derivation, propose a simple and effective explicit regularizer, called DR3, that counteracts the undesirable effects of this implicit regularizer. When combined with existing offline RL methods, DR3 substantially improves performance and stability, alleviating unlearning in Atari 2600 games, D4RL domains, and robotic manipulation from images. Link » Aviral Kumar · Rishabh Agarwal · Tengyu Ma · Aaron Courville · George Tucker · Sergey Levine 🔗 Mon 9:27 a.m. - 9:30 a.m. DR3: Value-Based Deep Reinforcement Learning Requires Explicit Regularization Q&A (Q&A)  link » Despite overparameterization, deep networks trained via supervised learning are surprisingly easy to optimize and exhibit excellent generalization. One hypothesis to explain this is that overparameterized deep networks enjoy the benefits of implicit regularization induced by stochastic gradient descent, which favors parsimonious solutions that generalize well on test inputs. It is reasonable to surmise that deep reinforcement learning (RL) methods could also benefit from this effect. In this paper, we discuss how the implicit regularization effect of SGD seen in supervised learning could in fact be harmful in the offline deep RL setting, leading to poor generalization and degenerate feature representations. Our theoretical analysis shows that when existing models of implicit regularization are applied to temporal difference learning, the resulting derived regularizer favors degenerate solutions with excessive aliasing, in stark contrast to the supervised learning case. We back up these findings empirically, showing that feature representations learned by a deep network value function trained via bootstrapping can indeed become degenerate, aliasing the representations for state-action pairs that appear on either side of the Bellman backup. To address this issue, we derive the form of this implicit regularizer and, inspired by this derivation, propose a simple and effective explicit regularizer, called DR3, that counteracts the undesirable effects of this implicit regularizer. When combined with existing offline RL methods, DR3 substantially improves performance and stability, alleviating unlearning in Atari 2600 games, D4RL domains, and robotic manipulation from images. Link » Aviral Kumar · Rishabh Agarwal · Tengyu Ma · Aaron Courville · George Tucker · Sergey Levine 🔗 Mon 9:30 a.m. - 9:42 a.m. HyAR: Addressing Discrete-Continuous Action Reinforcement Learning via Hybrid Action Representation (Oral) []   link »    Discrete-continuous hybrid action space is a natural setting in many practical problems, such as robot control and game AI. However, most previous Reinforcement Learning (RL) works only demonstrate the success in controlling with either discrete or continuous action space, while seldom take into account the hybrid action space. One naive way to address hybrid action RL is to convert the hybrid action space into a unified homogeneous action space by discretization or continualization, so that conventional RL algorithms can be applied. However, this ignores the underlying structure of hybrid action space and also induces the scalability issue and additional approximation difficulties, thus leading to degenerated results. In this paper, we propose Hybrid Action Representation (HyAR) to learn a compact and decodable latent representation space for the original hybrid action space. HyAR constructs the latent space and embeds the dependence between discrete action and continuous parameter via an embedding table and conditional Variantional Auto-Encoder (VAE). To further improve the effectiveness, the action representation is trained to be semantically smooth through unsupervised environmental dynamics prediction. Finally, the agent then learns its policy with conventional DRL algorithms in the learned representation space and interacts with the environment by decoding the hybrid action embeddings to the original action space. We evaluate HyAR in a variety of environments with discrete-continuous action space. The results demonstrate the superiority of HyAR when compared with previous baselines, especially for high-dimensional action spaces. Link » Boyan Li · Hongyao Tang · YAN ZHENG · Jianye Hao · Pengyi Li · Zhaopeng Meng · LI Wang 🔗 Mon 9:42 a.m. - 9:45 a.m. HyAR: Addressing Discrete-Continuous Action Reinforcement Learning via Hybrid Action Representation Q&A (Q&A)  link » Discrete-continuous hybrid action space is a natural setting in many practical problems, such as robot control and game AI. However, most previous Reinforcement Learning (RL) works only demonstrate the success in controlling with either discrete or continuous action space, while seldom take into account the hybrid action space. One naive way to address hybrid action RL is to convert the hybrid action space into a unified homogeneous action space by discretization or continualization, so that conventional RL algorithms can be applied. However, this ignores the underlying structure of hybrid action space and also induces the scalability issue and additional approximation difficulties, thus leading to degenerated results. In this paper, we propose Hybrid Action Representation (HyAR) to learn a compact and decodable latent representation space for the original hybrid action space. HyAR constructs the latent space and embeds the dependence between discrete action and continuous parameter via an embedding table and conditional Variantional Auto-Encoder (VAE). To further improve the effectiveness, the action representation is trained to be semantically smooth through unsupervised environmental dynamics prediction. Finally, the agent then learns its policy with conventional DRL algorithms in the learned representation space and interacts with the environment by decoding the hybrid action embeddings to the original action space. We evaluate HyAR in a variety of environments with discrete-continuous action space. The results demonstrate the superiority of HyAR when compared with previous baselines, especially for high-dimensional action spaces. Link » Boyan Li · Hongyao Tang · YAN ZHENG · Jianye Hao · Pengyi Li · Zhaopeng Meng · LI Wang 🔗 Mon 9:45 a.m. - 9:57 a.m. Benchmarking the Spectrum of Agent Capabilities (Oral) []   link »    Evaluating the general abilities of intelligent agents requires complex simulation environments. Existing benchmarks typically evaluate only one narrow task per environment, requiring researchers to perform expensive training runs on many different environments. We introduce Crafter, an open world survival game with visual inputs that evaluates a wide range of general abilities within a single environment. Agents either learn from the provided reward signal or through intrinsic objectives and are evaluated by semantically meaningful achievements that can be unlocked during each episode, such as discovering resources and crafting tools. Consistently unlocking all achievements requires strong generalization, deep exploration, and long-term reasoning. We experimentally verify that Crafter is of appropriate difficulty to drive future research and provide baselines scores of reward agents and unsupervised agents. Furthermore, we observe sophisticated behaviors emerging from maximizing the reward signal, such as building tunnel systems, bridges, houses, and plantations. We hope that Crafter will accelerate research progress by quickly evaluating a wide spectrum of abilities. Link » Danijar Hafner 🔗 Mon 9:57 a.m. - 10:00 a.m. Benchmarking the Spectrum of Agent Capabilities Q&A (Q&A)  link » Evaluating the general abilities of intelligent agents requires complex simulation environments. Existing benchmarks typically evaluate only one narrow task per environment, requiring researchers to perform expensive training runs on many different environments. We introduce Crafter, an open world survival game with visual inputs that evaluates a wide range of general abilities within a single environment. Agents either learn from the provided reward signal or through intrinsic objectives and are evaluated by semantically meaningful achievements that can be unlocked during each episode, such as discovering resources and crafting tools. Consistently unlocking all achievements requires strong generalization, deep exploration, and long-term reasoning. We experimentally verify that Crafter is of appropriate difficulty to drive future research and provide baselines scores of reward agents and unsupervised agents. Furthermore, we observe sophisticated behaviors emerging from maximizing the reward signal, such as building tunnel systems, bridges, houses, and plantations. We hope that Crafter will accelerate research progress by quickly evaluating a wide spectrum of abilities. Link » Danijar Hafner 🔗 Mon 10:00 a.m. - 10:25 a.m. Invited Talk: Laura Schulz - In praise of folly: Goals, play, and human cognition (Talk)    Work on autonomous agents in AI, robotics, and machine learning is often explicitly inspired by comparisons with children's play, and computational researchers and developmentalists alike tend to assume that play is rewarding because through play, agents can reduce uncertainty, increase expected information gain, and improve their predictive models of the world. I will review some developmental research consistent with this picture -- and then suggest that this account fails to capture much of what is distinctive about human play. I note that, rather than being characterized by progress towards rational goal-directed action, play typically involves manipulated utility functions, in which people willingly incur unnecessary costs to achieve arbitrary rewards. I will suggest that such "pretend" utilities may be critical not because they support better predictive models of the world but because they support ideas and plans that can change the world. That is, I propose that the reward value of play for humans is not about learning but about thinking. Laura Schulz 🔗 Mon 10:25 a.m. - 10:30 a.m. Laura Schulz Talk Q&A (Q&A) Laura Schulz 🔗 Mon 10:30 a.m. - 11:00 a.m. Break 🔗 Mon 11:00 a.m. - 11:25 a.m. Opinion Contributed Talk: Wilka Carvalho (Talk) Wilka Carvalho Carvalho 🔗 Mon 11:25 a.m. - 11:30 a.m. Wilka Carvalho Talk Q&A (Q&A) Wilka Carvalho Carvalho 🔗 Mon 11:30 a.m. - 11:42 a.m. Adaptive Scheduling of Data Augmentation for Deep Reinforcement Learning (Oral) []   link »    We consider data augmentation technique to improve data efficiency and generalization performance in reinforcement learning (RL). Our empirical study on Open AI Procgen shows that the timing of when applying augmentation is critical, and to maximize test performance, an augmentation needs to be applied either during the entire RL training, or after the end of RL training. More specifically, if the regularization imposed by augmentation is helpful only in testing, it is better to procrastinate the augmentation after training than to use it during training in terms of sample and computation complexity since such an augmentation often disturbs the training process. Conversely, an augmentation providing regularization useful in training needs to be used during the whole training period to fully utilize its benefit in terms of not only generalization but also data efficiency. Based on our findings, we propose a mechanism to fully exploit a set of augmentations, which identifies an augmentation (including no augmentation) to maximize RL training performance, and then utilizes all the augmentations by network distillation to maximize test performance. Our experiment empirically justifies the proposed method compared to other automatic augmentation mechanism. Link » Byungchan Ko · Jungseul Ok 🔗 Mon 11:42 a.m. - 11:45 a.m. Adaptive Scheduling of Data Augmentation for Deep Reinforcement Learning Q&A (Oral)  link » We consider data augmentation technique to improve data efficiency and generalization performance in reinforcement learning (RL). Our empirical study on Open AI Procgen shows that the timing of when applying augmentation is critical, and to maximize test performance, an augmentation needs to be applied either during the entire RL training, or after the end of RL training. More specifically, if the regularization imposed by augmentation is helpful only in testing, it is better to procrastinate the augmentation after training than to use it during training in terms of sample and computation complexity since such an augmentation often disturbs the training process. Conversely, an augmentation providing regularization useful in training needs to be used during the whole training period to fully utilize its benefit in terms of not only generalization but also data efficiency. Based on our findings, we propose a mechanism to fully exploit a set of augmentations, which identifies an augmentation (including no augmentation) to maximize RL training performance, and then utilizes all the augmentations by network distillation to maximize test performance. Our experiment empirically justifies the proposed method compared to other automatic augmentation mechanism. Link » Byungchan Ko · Jungseul Ok 🔗 Mon 11:45 a.m. - 11:57 a.m. Offline Meta-Reinforcement Learning with Online Self-Supervision (Oral) []   link »    Meta-reinforcement learning (RL) methods can meta-train policies that adapt to new tasks with orders of magnitude less data than standard RL, but meta-training itself is costly and time-consuming. If we can meta-train on offline data, then we can reuse the same static dataset, labeled once with rewards for different tasks, to meta-train policies that adapt to a variety of new tasks at meta-test time. Although this capability would make meta-RL a practical tool for real-world use, offline meta-RL presents additional challenges beyond online meta-RL or standard offline RL settings. Meta-RL learns an exploration strategy that collects data for adapting, and also meta-trains a policy that quickly adapts to data from a new task. Since this policy was meta-trained on a fixed, offline dataset, it might behave unpredictably when adapting to data collected by the learned exploration strategy, which differs systematically from the offline data and thus induces distributional shift. We do not want to remove this distributional shift by simply adopting a conservative exploration strategy, because learning an exploration strategy enables an agent to collect better data for faster adaptation. Instead, we propose a hybrid offline meta-RL algorithm, which uses offline data with rewards to meta-train an adaptive policy, and then collects additional unsupervised online data, without any reward labels, to bridge this distribution shift. By not requiring reward labels for online collection, this data can be much cheaper to collect. We compare our method to prior work on offline meta-RL on simulated robot locomotion and manipulation tasks and find that using additional unsupervised online data collection leads to a dramatic improvement in the adaptive capabilities of the meta-trained policies, matching the performance of fully online meta-RL on a range of challenging domains that require generalization to new tasks. Link » Vitchyr Pong · Ashvin Nair · Laura Smith · Catherine Huang · Sergey Levine 🔗 Mon 11:57 a.m. - 12:00 p.m. Offline Meta-Reinforcement Learning with Online Self-Supervision Q&A (Q&A)  link » Meta-reinforcement learning (RL) methods can meta-train policies that adapt to new tasks with orders of magnitude less data than standard RL, but meta-training itself is costly and time-consuming. If we can meta-train on offline data, then we can reuse the same static dataset, labeled once with rewards for different tasks, to meta-train policies that adapt to a variety of new tasks at meta-test time. Although this capability would make meta-RL a practical tool for real-world use, offline meta-RL presents additional challenges beyond online meta-RL or standard offline RL settings. Meta-RL learns an exploration strategy that collects data for adapting, and also meta-trains a policy that quickly adapts to data from a new task. Since this policy was meta-trained on a fixed, offline dataset, it might behave unpredictably when adapting to data collected by the learned exploration strategy, which differs systematically from the offline data and thus induces distributional shift. We do not want to remove this distributional shift by simply adopting a conservative exploration strategy, because learning an exploration strategy enables an agent to collect better data for faster adaptation. Instead, we propose a hybrid offline meta-RL algorithm, which uses offline data with rewards to meta-train an adaptive policy, and then collects additional unsupervised online data, without any reward labels, to bridge this distribution shift. By not requiring reward labels for online collection, this data can be much cheaper to collect. We compare our method to prior work on offline meta-RL on simulated robot locomotion and manipulation tasks and find that using additional unsupervised online data collection leads to a dramatic improvement in the adaptive capabilities of the meta-trained policies, matching the performance of fully online meta-RL on a range of challenging domains that require generalization to new tasks. Link » Vitchyr Pong · Ashvin Nair · Laura Smith · Catherine Huang · Sergey Levine 🔗 Mon 12:00 p.m. - 12:25 p.m. Invited Talk: George Konidaris - Signal to Symbol (via Skills) (Talk)    I will discuss a route to general AI where a general-purpose agent (which must have a complex, high-dimensional sensorimotor space) first autonomously learns abstract, task-specific representations - that reflects the complexity of the particular task the agent is currently solving, and not the agent itself - and then applies an appropriate generic solution method to the resulting abstract task. I will argue that such a representation can be learned via a combination of state- and action-abstractions. I will present my group's recent progress on learning abstract actions in the form of high-level options or skills. I will then consider the question of how to learn a compatible abstract state representation, taking a constructivist approach, where the computation the representation is required to support - here, planning using a set of (learned or given) skills - is precisely defined, and then its properties are used to build a representation capable of doing so by construction. The result is a formal link between state and action abstractins. I will present an example of a robot autonomously learning a (sound and complete) abstract representation directly from sensorimotor data, and then using it to plan. George Konidaris 🔗 Mon 12:25 p.m. - 12:30 p.m. George Konidaris Talk Q&A (Q&A) George Konidaris 🔗 Mon 12:30 p.m. - 1:30 p.m. Poster Session (in Gather Town) (Poster Session) https://eventhosts.gather.town/GrQnG6tF9h3vCzc4/drl-room-1 [Imitation Learning, Offline RL, and multi-agent RL] https://eventhosts.gather.town/s29R5zpCnjeoXGeE/drl-room-2 [Optimization and Architecture Design] https://eventhosts.gather.town/aP9gDWlxGZS6LK9H/drl-room-3 [Skills, multitask, meta-learning, and generalization] https://eventhosts.gather.town/I7ABtQn714hO5Z17/drl-room-4 [Representation, model-based, exploration, hierarchy, benchmarks] NOTE: not all papers fit cleanly into these categories 🔗 Mon 1:30 p.m. - 1:55 p.m. Opinion Contributed Talk: Sergey Levine (Talk) Sergey Levine 🔗 Mon 1:55 p.m. - 2:00 p.m. Sergey Levine Talk Q&A (Q&A) Sergey Levine 🔗 Mon 2:00 p.m. - 2:30 p.m. Panel Discussion 1 (Panel Discussion) 🔗 Mon 2:30 p.m. - 2:55 p.m. Invited Talk: Dale Schuurmans - Understanding Deep Value Estimation (Talk)    Estimating long term returns given short data trajectories remains a core technique in deep reinforcement learning. Remarkably, deep reinforcement learning in-the-wild often succeeds even when theoretical assumptions needed to guarantee good performance are neglected. I will discuss two recent investigations that shed some light on this phenomenon. First, I will discuss some findings about the implicit biases embodied by different value estimation algorithms, and why apparently unsound methods can still exhibit generalization advantages. Then I will discuss some recent ideas about how the risk of self-delusion in value estimation can be reduced through temporal grounding. These observations do not close the investigation, but do offer alternative prospects for improving deep value estimation in practice. Dale Schuurmans 🔗 Mon 2:55 p.m. - 3:00 p.m. Dale Schuurmans Talk Q&A (Q&A) Dale Schuurmans 🔗 Mon 3:00 p.m. - 3:30 p.m. Break 🔗 Mon 3:30 p.m. - 3:57 p.m. Invited Talk: Karol Hausman - Reinforcement Learning as a Data Sponge (Talk)    Modern supervised learning methods have shown that utilizing large amounts of diverse data can lead to remarkable results especially when it comes to generalization. On the other hand, reinforcement learning approaches, while powerful, remain fairly limited in terms of diversity of data they can effectively utilize, making it difficult to apply them to diverse, open-ended settings such as real-world robotics. In this talk, I'll talk about the steps we are taking towards making reinforcement learning a better "data sponge" and how they allow us to apply such methods to real robots at scale. Karol Hausman 🔗 Mon 3:55 p.m. - 4:00 p.m. Karol Hausman Talk Q&A (Q&A) Karol Hausman 🔗 Mon 4:00 p.m. - 4:30 p.m. NeurIPS RL Competitions Results Presentations (Presentations) Rohin Shah · Liam Paull · Tabitha Lee · Tim Rocktäschel · Heinrich Küttler · Sharada Mohanty · Manuel Wuethrich 🔗 Mon 4:30 p.m. - 4:55 p.m. Invited Talk: Kenji Doya - Natural and Artificial Reinforcement Learning (Talk)    Reinforcement learning started from an analogy of how animals learn behaviors from reward and punishment, and has made a remarkable progress with its successful combination with deep neural networks. We will first look into the brain’s mechanisms for reinforcement learning and then some recent advances in reinforcement learning theory and algorithms. We will finally discuss how recent theoretical advances may help deeper understand of the brain's mechanisms of reinforcement learning. Kenji Doya 🔗 Mon 4:55 p.m. - 5:00 p.m. Kenji Doya Talk Q&A (Q&A) Kenji Doya 🔗 Mon 5:00 p.m. - 6:00 p.m. Panel Discussion 2 (Panel Discussion) 🔗 - Self-Imitation Learning from Demonstrations (Poster) []   link »    Despite the numerous breakthroughs achieved with Reinforcement Learning (RL), Self-Imitation Learning from Demonstrationssolving environments with sparse rewards remains a challenging task that requires sophisticated exploration. Learning from Demonstrations (LfD) remedies this issue by guiding agent’s exploration towards states experienced by an expert. Naturally, the benefits of this approach hinge on the quality of demonstrations, which are rarely optimal in realistic scenarios. Modern LfD algorithms lack provable robustness to suboptimal demonstrations and introduce additional hyperparameters to control the influence of demonstrations. To address these issues, we extend Self-Imitation Learning (SIL), a recent RL algorithm that exploits agent’s past good experience, to the LfD setup by initializing its replay buffer with demonstrations. We denote our algorithm as SIL from Demonstrations (SILfD). Our theoretical analysis highlights that SILfD is safe to apply to demonstrations of any degree of suboptimality and automatically adjusts the influence of demonstrations throughout the training. Our empirical investigation shows the superiority of SIL over existing LfD algorithms in settings of suboptimal demonstrations and sparse rewards. Link » Georgiy Pshikhachev · Dmitry Ivanov · Vladimir Egorov · Aleksei Shpilman 🔗 - Understanding and Preventing Capacity Loss in Reinforcement Learning (Poster) []   link » The reinforcement learning (RL) problem is rife with sources of non-stationaritythat can destabilize or inhibit learning progress. We identify a key mechanismby which this occurs in agents using neural networks as function approximators:capacity loss, whereby networks trained to predict a sequence of target values losetheir ability to quickly fit new functions over time. We demonstrate that capacityloss occurs in a broad range of RL agents and environments, and is particularlydamaging to learning progress in sparse-reward tasks. We then present a simpleregularizer, Initial Feature Regularization (InFeR), that mitigates this phenomenonby regressing a subspace of features towards its value at initialization, improvingperformance over a state-of-the-art model-free algorithm in the Atari 2600 suite.Finally, we study how this regularization affects different notions of capacity andevaluate other mechanisms by which it may improve performance. Link » Clare Lyle · Mark Rowland · Will Dabney 🔗 - Variance-Seeking Meta-Exploration to Handle Out-of-Distribution Tasks (Poster) []   link » Meta-Reinforcement Learning (meta-RL) yields the potential to improve the sample efficiency of reinforcement learning algorithms. Through training an agent on multiple meta-RL tasks, the agent is able to learn a policy based on past experience, and leverage this to solve new, unseen tasks. Accordingly, meta-RL promises to solve real-world problems, such as real-time heating, ventilation and air-conditioning(HVAC) control without accurate simulators of the target building. In this paper, we propose a meta-RL method which trains an agent on first order models to efficiently learn and adapt to the internal dynamics of a real-world building. We recognise that meta-agents trained on first order simulator models do not perform well on second order models, owing to the meta-RL assumption that the test tasks should be from within the same distribution as the training tasks. In response, we propose a novel exploration method called variance seeking meta-exploration which enables a meta-RL agent to perform well on complex tasks outside of its training distribution.Our method programs the agent to prefer exploring task dependent state-action pairs, and in turn, allows it to adapt efficiently to challenging second order models which bear greater semblance to real-world problems Link » Yashvir Singh Grewal · Sarah Goodwin 🔗 - A Closer Look at Gradient Estimators with Reinforcement Learning as Inference (Poster) []   link »    The concept of reinforcement learning as inference (RLAI) has led to the creation of a variety of popular algorithms in deep reinforcement learning. Unfortunately, most research in this area relies on wider algorithmic innovations not necessarily relevant to such frameworks. Additionally, many seemingly unimportant modifications made to these algorithms, actually produce inconsistencies with the original inference problem posed by RLAI. Taking a divergence minimization perspective, this work considers some of the practical merits and theoretical issues created by the choice of loss function minimized in the policy update in off-policy reinforcement learning. Our results show that while the choice of divergence rarely has a major affect on the sample efficiency of the algorithm, it can have important practical repercussions on ease of implementation, computational efficiency, and restrictions to the distribution over actions. Link » Wilder Lavington · Michael Teng · Mark Schmidt · Frank Wood 🔗 - From One Hand to Multiple Hands: Imitation Learning for Dexterous Manipulation from Single-Camera Teleoperation (Poster) []   link »    We introduce a novel single-camera teleoperation system for learning dexterous manipulation. Our system allows human operators to collect 3D demonstrations efficiently with only an iPad and a computer. These demonstrations are then used for imitation learning on complex multi-finger robot hand manipulation tasks. One key contribution of our system is that we construct a customized robot hand for each user in the physical simulator, which is a manipulator resembling the same kinematics structure and shape of the operator's hand. This not only avoids unstable human-robot hand retargetting during data collection, but also provides a more intuitive and personalized interface for different users to operate on. Once the data collection is done, the customized robot hand trajectories can be converted to different specified robot hands (models that are manufactured and commercialized) to generate training demonstrations. Using the data collected on the customized hand, our imitation learning results show large improvement over pure RL on multiple specified robot hands. Link » Yuzhe Qin · Hao Su · Xiaolong Wang 🔗 - Attention-based Partial Decoupling of Policy and Value for Generalization in Reinforcement Learning (Poster) []   link »    In this work, we introduce Attention-based Partially Decoupled Actor-Critic (APDAC), an actor-critic architecture for generalization in reinforcement learning, which partially separates the policy and the value function. To learn directly from images, traditional actor-critic architectures use a shared network to represent the policy and value function. While a shared representation for policy and value allows parameter and feature sharing, it can also lead to overfitting that catastrophically hurts generalization performance. On the other hand, two separate networks for policy and value can help to avoid overfitting and reduce the generalization gap, but at the cost of added complexity both in terms of architecture design and hyperparameter tuning. APDAC provides an intermediate tradeoff that combines the strengths of both architectures by sharing the initial part of the network and separating the later parts for policy and value. It also incorporates an attention mechanism to propagate relevant features to the separate policy and value blocks. Our empirical analysis shows that APDAC significantly outperforms the PPO baseline and achieves comparable performance with respect to the recent state-of-the-art method IDAAC on the challenging RL generalization benchmark Procgen. Link » Nasik Nafi · Creighton Glasscock · Bill Hsu 🔗 - Imitation Learning from Observations under Transition Model Disparity (Poster) []   link » Learning to perform tasks by leveraging a dataset of expert observations, also known as imitation learning from observations (ILO), is an important paradigm for learning skills without access to the expert reward function or the expert actions. We consider ILO in the setting where the expert and the learner agents operate in different environments, with the source of the discrepancy being the transition dynamics model. Recent methods for scalable ILO utilize adversarial learning to match the state-transition distributions of the expert and the learner, an approach that becomes challenging when the dynamics are dissimilar. In this work, we propose an algorithm that trains an intermediary policy in the learner environment and uses it as a surrogate expert for the learner. The intermediary policy is learned such that the state transitions generated by it are close to the state transitions in the expert dataset. To derive a practical and scalable algorithm, we employ concepts from prior work on estimating the support of a probability distribution. Experiments using MuJoCo locomotion tasks highlight that our method compares favorably to the baselines for ILO with transition dynamics mismatch. Link » Tanmay Gangwani · Yuan Zhou · Jian Peng 🔗 - Vision-Guided Quadrupedal Locomotion in the Wild with Multi-Modal Delay Randomization (Poster) []   link » Developing robust vision-guided controllers for quadrupedal robots in complex environments, with various obstacles, dynamical surroundings and uneven terrains, is very challenging. While Reinforcement Learning (RL) provides a promising paradigm for agile locomotion skills with vision inputs in simulation, it is still very challenging to deploy the RL policy in the real world. Our key insight is that aside from the discrepancy in the domain gap, in visual appearance between the simulation and the real world, the latency from the control pipeline is also a major cause of difficulty. In this paper, we propose Multi-Modal Delay Randomization (MMDR) to address this issue when training RL agents. Specifically, we simulate the latency of real hardware by using past observations, sampled with randomized periods, for both proprioception and vision. We train the RL policy for end-to-end control in a physical simulator without any predefined controller or reference motion, and directly deploy it on the real A1 quadruped robot running in the wild. We evaluate our method in different outdoor environments with complex terrains and obstacles. We demonstrate the robot can smoothly maneuver at a high speed, avoid the obstacles, and show significant improvement over the baselines. Link » Minghao Zhang · Ruihan Yang · Yuzhe Qin · Xiaolong Wang 🔗 - Learning from demonstrations with SACR2: Soft Actor-Critic with Reward Relabeling (Poster) []   link »    During recent years, deep reinforcement learning (DRL) has made successful incursions into complex decision-making applications such as robotics, autonomous driving or video games. However, a well-known caveat of DRL algorithms is their inefficiency, requiring huge amounts of data to converge. Off-policy algorithms tend to be more sample-efficient, and can additionally benefit from any off-policy data stored in the replay buffer. Expert demonstrations are a popular source for such data: the agent is exposed to successful states and actions early on, which can accelerate the learning process and improve performance. In the past, multiple ideas have been proposed to make good use of the demonstrations in the buffer, such as pretraining on demonstrations only or minimizing additional cost functions.We carry on a study to evaluate several of these ideas in isolation, to see which of them have the most significant impact. We also present a new method, based on a reward bonus given to demonstrations and successful episodes. First, we give a reward bonus to the transitions coming from demonstrations to encourage the agent to match the demonstrated behaviour. Then, upon collecting a successful episode, we relabel its transitions with the same bonus before adding them to the replay buffer, encouraging the agent to also match its previous successes.The base algorithm for our experiments is the popular Soft Actor-Critic (SAC), a state-of-the-art off-policy algorithm for continuous action spaces.Our experiments focus on robotics, specifically on a reaching task for a robotic arm in simulation. We show that our method SACR2 based on reward relabeling improves the performance on this task, even in the absence of demonstrations. Link » Jesús Bujalance Martín · Raphael Chekroun · Fabien Moutarde 🔗 - Plan Better Amid Conservatism: Offline Multi-Agent Reinforcement Learning with Actor Rectification (Poster) []   link »    The idea of conservatism has led to significant progress in offline reinforcement learning (RL) where an agent learns from pre-collected datasets. However, it is still an open question to resolve offline RL in the more practical multi-agent setting as many real-world scenarios involve interaction among multiple agents. Given the recent success of transferring online RL algorithms to the multi-agent setting, one may expect that offline RL algorithms will also transfer to multi-agent settings directly. Surprisingly, when conservatism-based algorithms are applied to the multi-agent setting, the performance degrades significantly with an increasing number of agents. Towards mitigating the degradation, we identify that a key issue that the landscape of the value function can be non-concave and policy gradient improvements are prone to local optima. Multiple agents exacerbate the problem since the suboptimal policy by any agent could lead to uncoordinated global failure. Following this intuition, we propose a simple yet effective method, \underline{O}ffline \underline{M}ulti-Agent RL with \underline{A}ctor \underline{R}ectification (OMAR), to tackle this critical challenge via an effective combination of first-order policy gradient and zeroth-order optimization methods for the actor to better optimize the conservative value function. Despite the simplicity, OMAR significantly outperforms strong baselines with state-of-the-art performance in multi-agent continuous control benchmarks. Link » Ling Pan · Longbo Huang · Tengyu Ma · Huazhe Xu 🔗 - Generalisation in Lifelong Reinforcement Learning through Logical Composition (Poster) []   link »    We leverage logical composition in reinforcement learning to create a framework that enables an agent to autonomously determine whether a new task can be immediately solved using its existing abilities, or whether a task-specific skill should be learned. In the latter case, the proposed algorithm also enables the agent to learn the new task faster by generating an estimate of the optimal policy. Importantly, we provide two main theoretical results: we give bounds on the performance of the transferred policy on a new task, and we give bounds on the necessary and sufficient number of tasks that need to be learned throughout an agent's lifetime to generalise over a distribution. We verify our approach in a series of experiments, where we perform transfer learning both after learning a set of base tasks, and after learning an arbitrary set of tasks. We also demonstrate that as a side effect of our transfer learning approach, an agent can produce an interpretable Boolean expression of its understanding of the current task. Finally, we demonstrate our approach in the full lifelong setting where an agent receives tasks from an unknown distribution and, starting from zero skills, is able to quickly generalise over the task distribution after learning only a few tasks---which are sub-logarithmic in the size of the task space. Link » Geraud Nangue Tasse · Steven James · Benjamin Rosman 🔗 - DreamerPro: Reconstruction-Free Model-Based Reinforcement Learning with Prototypical Representations (Poster) []   link »    Top-performing Model-Based Reinforcement Learning (MBRL) agents, such as Dreamer, learn the world model by reconstructing the image observations. Hence, they often fail to discard task-irrelevant details and struggle to handle visual distractions. To address this issue, previous work has proposed to contrastively learn the world model, but the performance tends to be inferior in the absence of distractions. In this paper, we seek to enhance robustness to distractions for MBRL agents by learning better representations in the world model. For this, prototypical representations seem to be a good candidate, as they have yielded more accurate and robust results than contrastive approaches in computer vision. However, it remains elusive how prototypical representations can benefit temporal dynamics learning in MBRL, since they treat each image independently without capturing temporal structures. To this end, we propose to learn the prototypes from the recurrent states of the world model, thereby distilling temporal structures from past observations and actions into the prototypes. The resulting model, DreamerPro, successfully combines Dreamer with prototypes, making large performance gains on the DeepMind Control suite both in the standard setting and when there are complex background distractions. Link » Fei Deng · Ingook Jang · Sungjin Ahn 🔗 - Temporal-Difference Value Estimation via Uncertainty-Guided Soft Updates (Poster) []   link »    Temporal-Difference (TD) learning methods, such as Q-Learning, have proven effective at learning a policy to perform control tasks. One issue with methods like Q-Learning is that the value update introduces bias when predicting the TD target of a unfamiliar state. Estimation noise becomes a bias after the max operator in the policy improvement step, and carries over to value estimations of other states, causing Q-Learning to overestimate the Q value. Algorithms like Soft Q-Learning (SQL) introduce the notion of a soft-greedy policy, which reduces the estimation bias via soft updates in early stages of training. However, the inverse temperature $\beta$ that controls the softness of an update is usually set by a hand-designed heuristic, which can be inaccurate at capturing the uncertainty in the target estimate. Under the belief that $\beta$ is closely related to the (state dependent) model uncertainty, Entropy Regularized Q-Learning (EQL) further introduces a principled scheduling of $\beta$ by maintaining a collection of the model parameters that characterizes model uncertainty. In this paper, we present Unbiased Soft Q-Learning (UQL), which extends the work of EQL from two action, finite state spaces to multi-action, infinite state space Markov Decision Processes. We also provide a principled numerical scheduling of $\beta$, extended from SQL and using model uncertainty, during the optimization process. We show the theoretical guarantees and the effectiveness of this update method in experiments on several discrete control environments. Link » Litian Liang · Yaosheng Xu · Stephen McAleer · Dailin Hu · Alexander Ihler · Pieter Abbeel · Roy Fox 🔗 - Improving Actor-Critic Reinforcement Learning via Hamiltonian Monte Carlo Method (Poster) []   link » The actor-critic RL is widely used in various robotic control tasks. By viewing the actor-critic RL from the perspective of variational inference (VI), the policy network is trained to obtain the approximate posterior of actions given the optimality criteria. However, in practice, the actor-critic RL may yield suboptimal policy estimates due to the amortization gap and insufficient exploration. In this work, inspired by the previous use of Hamiltonian Monte Carlo (HMC) in VI, we propose to integrate the policy network of actor-critic RL with HMC, which is termed as {\it Hamiltonian Policy}. As such we propose to evolve actions from the base policy according to HMC, and our proposed method has many benefits. First, HMC can improve the policy distribution to better approximate the posterior and hence reduce the amortization gap. Second, HMC can also guide the exploration more to the regions of action spaces with higher Q values, enhancing the exploration efficiency. Further, instead of directly applying HMC into RL, we propose a new leapfrog operator to simulate the Hamiltonian dynamics. Finally, in safe RL problems, we find that the proposed method can not only improve the achieved return, but also reduce safety constraint violations by discarding potentially unsafe actions. With comprehensive empirical experiments on continuous control baselines, including MuJoCo and PyBullet Roboschool, we show that the proposed approach is a data-efficient and easy-to-implement improvement over previous actor-critic methods. Codes are available online$^{1}$. Link » Duo XU 🔗 - Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers (Poster) []   link »    We propose to address quadrupedal locomotion tasks using Reinforcement Learning (RL) with a Transformer-based model that learns to combine proprioceptive information and high-dimensional depth sensor inputs. While learning-based locomotion has made great advances using RL, most methods still rely on domain randomization for training blind agents that generalize to challenging terrains. Our key insight is that proprioceptive states only offer contact measurements for immediate reaction, whereas an agent equipped with visual sensory observations can learn to proactively maneuver environments with obstacles and uneven terrain by anticipating changes in the environment many steps ahead. In this paper, we introduce LocoTransformer, an end-to-end RL method that leverages both proprioceptive states and visual observations for locomotion control. We evaluate our method in challenging simulated environments with different obstacles and uneven terrain. We transfer our learned policy from simulation to a real robot by running it indoor and in-the-wild with unseen obstacles and terrain. Our method not only significantly improves over baselines, but also achieves far better generalization performance, especially when transferred to the real robot. Our project page with videos is at https://LocoTransformer.github.io/ . Link » Ruihan Yang · Minghao Zhang · Nicklas Hansen · Huazhe Xu · Xiaolong Wang 🔗 - Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation (Poster) []   link » Learning to solve precision-based manipulation tasks from visual feedback using Reinforcement Learning (RL) could drastically reduce the engineering efforts required by traditional robot systems. However, performing fine-grained motor control from visual inputs alone is challenging, especially with a static third-person camera as often used in previous work. We propose a setting for robotic manipulation in which the agent receives visual feedback from both a third-person camera and an egocentric camera mounted on the robot's wrist. While the third-person camera is static, the egocentric camera enables the robot to actively control its vision to aid in precise manipulation. To fuse visual information from both cameras effectively, we additionally propose to use Transformers with a cross-view attention mechanism that models spatial attention from one view to another (and vice-versa), and use the learned features as input to an RL policy. Our method improves learning over strong single-view and multi-view baselines, and successfully transfers to a set of challenging manipulation tasks on a real robot with uncalibrated cameras, no access to state information, and a high degree of task variability. In a hammer manipulation task, our method succeeds in $75\%$ of trials versus $38\%$ and $13\%$ for multi-view and single-view baselines, respectively. Link » Rishabh Jangir · Nicklas Hansen · Xiaolong Wang 🔗 - Learning Value Functions from Undirected State-only Experience (Poster) []   link »    This paper tackles the problem of learning value functions from undirected state-only experience (state transitions without action labels i.e. (s,s',r) tuples). We first theoretically characterize the applicability of Q-learning in this setting. We show that tabular Q-learning in discrete Markov decision processes (MDPs) learns the same value function under any arbitrary refinement of the action space. This theoretical result motivates the design of Latent Action Q-learning or LAQ, an offline RL method that can learn effective value functions from state-only experience. Latent Action Q-learning (LAQ) learns value functions using Q-learning on discrete latent actions obtained through a latent-variable future prediction model. We show that LAQ can recover value functions that have high correlation with value functions learned using ground truth actions. Value functions learned using LAQ lead to sample efficient acquisition of goal-directed behavior, can be used with domain-specific low-level controllers, and facilitate transfer across embodiments. Our experiments in 5 environments ranging from 2D grid world to 3D visual navigation in realistic environments demonstrate the benefits of LAQ over simpler alternatives, imitation learning oracles, and competing methods. Link » Matthew Chang · Arjun Gupta · Saurabh Gupta 🔗 - Target Entropy Annealing for Discrete Soft Actor-Critic (Poster) []   link »    Soft Actor-Critic (SAC) is considered the state-of-the-art algorithm in continuous action space settings. It uses the maximum entropy framework for efficiency and stability, and applies a heuristic temperature Lagrange term to tune the temperature $\alpha$, which determines how "soft" the policy should be. It is counter-intuitive that empirical evidence shows SAC does not perform well in discrete domains. In this paper we investigate the possible explanations for this phenomenon and propose Target Entropy Scheduled SAC (TES-SAC), an annealing method for the target entropy parameter applied on SAC. Target entropy is a constant in the temperature Lagrange term and represents the target policy entropy in discrete SAC. We compare our method on Atari 2600 games with different constant target entropy SAC, and analyze on how our scheduling affects SAC. Link » Yaosheng Xu · Dailin Hu · Litian Liang · Stephen McAleer · Pieter Abbeel · Roy Fox 🔗 - Learning Action Translator for Meta Reinforcement Learning on Sparse-Reward Tasks (Poster) []   link »    Meta reinforcement learning (meta-RL) aims to learn a policy solving a set of training tasks simultaneously and quickly adapting to new tasks. It requires massive amounts of data drawn from training tasks to infer the common structure shared among tasks. Without heavy reward engineering, the sparse rewards in long-horizon tasks exacerbate the problem of sample efficiency in meta-RL. Another challenge in meta-RL is the discrepancy of difficulty level among tasks, which might cause one easy task dominating learning of the shared policy and thus preclude policy adaptation to new tasks. In this work, we introduce a novel objective function to learn an action translator among training tasks. We theoretically verify that value of the transferred policy with the action translator can be close to the value of the source policy. We propose to combine the action translator with context-based meta-RL algorithms for better data collection and more efficient exploration during meta-training. Our approach of policy transfer empirically improves the sample efficiency and performance of meta-RL algorithms on sparse-reward tasks. Link » Yijie Guo · Qiucheng Wu · Honglak Lee 🔗 - Follow the Object: Curriculum Learning for Manipulation Tasks with Imagined Goals (Poster) []   link » Learning robot manipulation through deep reinforcement learning in environments with sparse rewards is a challenging task. In this paper we address this problem by introducing a notion of imaginary object goals. For a given manipulation task, the object of interest is first trained to reach a desired target position on its own, without being manipulated, through physically realistic simulations. The object policy is then leveraged to build a predictive model of plausible object trajectories providing the robot with a curriculum of incrementally more difficult object goals to reach during training. The proposed algorithm, Follow the Object (FO), has been evaluated on 7 MuJoCo environments requiring increasing degree of exploration, and has achieved higher success rates compared to alternative algorithms. In particularly challenging learning scenarios, e.g. where the object's initial and target positions are far apart, our approach can still learn a policy whereas competing methods currently fail. Link » Ozsel Kilinc · Giovanni Montana 🔗 - The Reflective Explorer: Online Meta-Exploration from Offline Data in Realistic Robotic Tasks (Poster) []   link »    Reinforcement learning is difficult to apply to real world problems due to high sample complexity, the need to adapt to frequent distribution shifts and the complexities of learning from high-dimensional inputs, such as images. Over the last several years, meta-learning has emerged as a promising approach to tackle these problems by explicitly training an agent to quickly adapt to new tasks. However, such methods still require huge amounts of data during training and are difficult to optimize in high-dimensional domains. One potential solution is to consider offline or batch meta-reinforcement learning (RL) - learning from existing datasets without additional environment interactions during training. In this work we develop the first offline model-based meta-RL algorithm that operates from images in tasks with sparse rewards. Our approach has three main components: a novel strategy to construct meta-exploration trajectories from offline data, which allows agents to learn meaningful meta-test time task inference strategy; representation learning via variational filtering and latent conservative model-free policy optimization. We show that our method completely solves a realistic meta-learning task involving robot manipulation, while naive combinations of previous approaches fail. Link » Rafael Rafailov · · Tianhe Yu · Avi Singh · Mariano Phielipp · Chelsea Finn 🔗 - BLAST: Latent Dynamics Models from Bootstrapping (Poster) []   link »    State-of-the-art world models such as DreamerV2 have significantly improved the capabilities of model-based reinforcement learning. However, these approaches typically rely on reconstruction losses to shape their latent representations of the environment, which are known to fail in environments with high fidelity visual observations. When learning latent dynamics models without reconstruction loss using only the signal present in the reward signal, the performance of these methods also drops dramatically. We present a simple modification to DreamerV2 without reconstruction loss inspired by the recent self-supervised learning method Bootstrap Your Own Latent. The combination of adding a stop-gradient to the posterior, using a powerful auto-regressive model for the prior, and using a slowly updating target encoder, which we call BLAST, allows the world model to learn from signals present in both the reward and observations, improving efficiency on our tested environment as well as being significantly more robust to visual distractors. Link » Keiran Paster · Lev McKinney · Sheila McIlraith · Jimmy Ba 🔗 - Value Function Spaces: Skill-Centric State Abstractions for Long-Horizon Reasoning (Poster) []   link »    Reinforcement learning can train policies that effectively perform complex tasks. However for long-horizon tasks, the performance of these methods degrades with horizon, often necessitating reasoning over and composing lower-level skills. Hierarchical reinforcement learning aims to enable this by providing a bank of low-level skills as action abstractions. Hierarchies can further improve on this by abstracting the space states as well. We posit that a suitable state abstraction should depend on the capabilities of the available lower-level policies. We propose Value Function Spaces: a simple approach that produces such a representation by using the value functions corresponding to each lower-level skill. These value functions capture the affordances of the scene, thus forming a representation that compactly abstracts task relevant information and robustly ignores distractors. Empirical evaluations for maze-solving and robotic manipulation tasks demonstrate that our approach improves long-horizon performance and enables better zero-shot generalization than alternative model-free and model-based methods. Link » shah · Ted Xiao · Alexander Toshev · Sergey Levine · brian ichter 🔗 - Count-Based Temperature Scheduling for Maximum Entropy Reinforcement Learning (Poster) []   link »    Maximum Entropy Reinforcement Learning (MaxEnt RL) algorithms such as Soft Q-Learning (SQL) and Soft Actor–Critic trade off reward and policy entropy, which has the potential to improve training stability and robustness. Most MaxEnt RL methods, however, use a constant tradeoff coefficient (temperature), contrary to the intuition that the temperature should be high early in training to avoid overfitting to noisy value estimates and decrease later in training as we increasingly trust high value estimates to truly lead to good rewards. Moreover, our confidence in value estimates is state-dependent, increasing every time we use more evidence to update an estimate. In this paper, we present a simple state-based temperature scheduling approach, and instantiate it for SQL as Count-Based Soft Q-Learning (CBSQL). We evaluate our approach on a toy domain as well as in several Atari 2600 domains and show promising results. Link » Dailin Hu · Pieter Abbeel · Roy Fox 🔗 - Data Sharing without Rewards in Multi-Task Offline Reinforcement Learning (Poster) []   link » Offline reinforcement learning (RL) bears the promise to learn effective control policies from static datasets but is thus far unable to learn from large databases of heterogeneous experience. The multi-task version of offline RL enables the possibility of learning a single policy that can tackle multiple tasks and allows the algorithm to share offline data across tasks. Recent works indicate that sharing data between tasks can be highly beneficial in multi-task learning. However, these benefits come at a cost -- for data to be shared between tasks, each transition must be annotated with reward labels corresponding to other tasks. This is particularly expensive and unscalable, since the manual effort in annotating reward grows quadratically with the number of tasks. Can we retain the benefits of data sharing without requiring reward relabeling for every task pair? In this paper, we show that, perhaps surprisingly, under a binary-reward assumption, simply utilizing data from other tasks with constant reward labels can not only provide a substantial improvement over only using the single-task data and previously proposed success classifiers, but it can also reach comparable performance to baselines that take advantage of the oracle multi-task reward information. We also show that this performance can be further improved by selectively deciding which transitions to share, again without introducing any additional models or classifiers. We discuss how these approaches relate to each other and baseline strategies under various assumptions on the dataset. Our empirical results show that it leads to improved performance across a range of different multi-task offline RL scenarios, including robotic manipulation from visual inputs and ant-maze navigation. Link » Tianhe Yu · Aviral Kumar · Yevgen Chebotar · Chelsea Finn · Sergey Levine · Karol Hausman 🔗 - StarCraft II Unplugged: Large Scale Offline Reinforcement Learning (Poster) []   link »    StarCraft II is one of the most challenging reinforcement learning (RL) environments; it is partially observable, stochastic, and multi-agent, and mastering StarCraft II requires strategic planning over long-time horizons with real-time low-level execution. It also has an active human competitive scene. StarCraft II is uniquely suited for advancing offline RL algorithms, both because of its challenging nature and because a massive dataset of millions of StarCraft II games played by human players has been released by Blizzard. This paper leverages that and establishes a benchmark, which we call StarCraft II Unplugged, that introduces unprecedented challenges for offline reinforcement learning. We define a dataset (a subset of Blizzard’s release), tools standardising an API for ML methods, and an evaluation protocol. We also present baseline agents, including behaviour cloning, and offline variants of V-trace actor-critic and MuZero. We find that the variants of those algorithms with behaviour value estimation and single step policy improvement work best and exceed 90% win rate against previously published AlphaStar behaviour cloning agents. Link » Michael Mathieu · Sherjil Ozair · Srivatsan Srinivasan · CAGLAR Gulcehre · Shangtong Zhang · Ray Jiang · Tom Paine · Konrad Żołna · Julian Schrittwieser · David Choi · Petko I Georgiev · Daniel Toyama · Roman Ring · Igor Babuschkin · Timo Ewalds · sergomezcol · Aaron van den Oord · Wojciech Czarnecki · Nando de Freitas · Oriol Vinyals 🔗 - Learning Robust Dynamics through Variational Sparse Gating (Poster) []   link »    Latent dynamics models learn an abstract representation of an environment based on collected experience. Such models are the core of recent advances in model-based reinforcement learning. For example, world models can imagine unseen trajectories, potentially improving sample efficiency. Planning in the real-world requires agents to understand long-term dependencies between actions and events, and account for varying degree of changes, e.g. due to a change in background or viewpoint. Moreover, in a typical scene, only a subset of objects change their state. These changes are often quite sparse which suggests incorporating such an inductive bias in a dynamics model. In this work, we introduce the variational sparse gating mechanism, which enables an agent to sparsely update a latent dynamics model state. We also present a simplified version, which unlike prior models, has a single stochastic recurrent state. Finally, we introduce a new ShapeHerd environment, in which an agent needs to push shapes into a goal area. This environment is partially-observable and requires models to remember the previously observed objects and explore the environment to discover unseen objects. Our experiments show that the proposed methods significantly outperform leading model-based reinforcement learning methods on this environment, while also yielding competitive performance on tasks from the DeepMind Control Suite. Link » Arnav Kumar Jain · Shivakanth Sujit · Shruti Joshi · Vincent Michalski · Danijar Hafner · Samira Ebrahimi Kahou 🔗 - Should I Run Offline Reinforcement Learning or Behavioral Cloning? (Poster) []   link » Offline reinforcement learning (RL) algorithms can acquire effective policies by utilizing only previously collected experience, without any online interaction. While it is widely understood that offline RL is able to extract good policies even from highly suboptimal data, in practice offline RL is often used with data that resembles demonstrations. In this case, one can also use behavioral cloning (BC) algorithms, which mimic a subset of the dataset via supervised learning. It seems natural to ask: When should we prefer offline RL over BC? In this paper, our goal is to characterize environments and dataset compositions where offline RL leads to better performance than BC. In particular, we characterize the properties of environments that allow offline RL methods to perform better than BC methods even when only provided with expert data. Additionally, we show that policies trained on suboptimal data that is sufficiently noisy can attain better performance than even BC algorithms with expert data, especially on long-horizon problems. We validate our theoretical results via extensive experiments on both diagnostic and high-dimensional domains including robot manipulation, maze navigation, and Atari games when learning from a variety of data sources. We observe that modern offline RL methods trained on suboptimal, noisy data in sparse reward domains outperform cloning the expert data in several practical problems. Link » Aviral Kumar · Joey Hong · Anikait Singh · Sergey Levine 🔗 - DR3: Value-Based Deep Reinforcement Learning Requires Explicit Regularization (Poster) []   link » Despite overparameterization, deep networks trained via supervised learning are surprisingly easy to optimize and exhibit excellent generalization. One hypothesis to explain this is that overparameterized deep networks enjoy the benefits of implicit regularization induced by stochastic gradient descent, which favors parsimonious solutions that generalize well on test inputs. It is reasonable to surmise that deep reinforcement learning (RL) methods could also benefit from this effect. In this paper, we discuss how the implicit regularization effect of SGD seen in supervised learning could in fact be harmful in the offline deep RL setting, leading to poor generalization and degenerate feature representations. Our theoretical analysis shows that when existing models of implicit regularization are applied to temporal difference learning, the resulting derived regularizer favors degenerate solutions with excessive aliasing, in stark contrast to the supervised learning case. We back up these findings empirically, showing that feature representations learned by a deep network value function trained via bootstrapping can indeed become degenerate, aliasing the representations for state-action pairs that appear on either side of the Bellman backup. To address this issue, we derive the form of this implicit regularizer and, inspired by this derivation, propose a simple and effective explicit regularizer, called DR3, that counteracts the undesirable effects of this implicit regularizer. When combined with existing offline RL methods, DR3 substantially improves performance and stability, alleviating unlearning in Atari 2600 games, D4RL domains, and robotic manipulation from images. Link » Aviral Kumar · Rishabh Agarwal · Tengyu Ma · Aaron Courville · George Tucker · Sergey Levine 🔗 - Deep RePReL--Combining Planning and Deep RL for acting in relational domains (Poster) []   link »    We consider the problem of combining a symbolic planner and a Deep RL agent to achieve the best of both worlds -- the generalization ability of the planner with the effective learning ability of Deep RL. To this effect, we extend a previous work of Kokel et al. ICAPS 2021, RePReL, to Deep RL. As we demonstrate in experiments in two relational worlds, this combination enables effective learning, transfer and generalization when compared to the use of only Deep RL. Link » Harsha Kokel · Arjun Manoharan · Sriraam Natarajan · Balaraman Ravindran · Prasad Tadepalli 🔗 - Fast Inference and Transfer of Compositional Task for Few-shot Task Generalization (Poster) []   link »    We tackle real-world problems with complex structures beyond the pixel-based game or simulator. We formulate it as a few-shot reinforcement learning problem where a task is characterized by a subtask graph that defines a set of subtasks and their dependencies that are unknown to the agent. Different from the previous meta-rl methods trying to directly infer the unstructured task embedding, our multi-task subtask graph inferencer (MTSGI) first infers the common high-level task structure in terms of the subtask graph from the training tasks, and use it as a prior to improve the task inference in testing. Our experiment results on 2D grid-world and complex web navigation domains show that the proposed method can learn and leverage the common underlying structure of the tasks for faster adaptation to the unseen tasks than various existing algorithms such as meta reinforcement learning, hierarchical reinforcement learning, and other heuristic agents. Link » Sungryull Sohn · Hyunjae Woo · Jongwook Choi · Izzeddin Gur · Aleksandra Faust · Honglak Lee 🔗 - Benchmark for Out-of-Distribution Detection in Deep Reinforcement Learning (Poster) []   link »    Reinforcement Learning (RL) based solutions are being adopted in a variety of domains including robotics, health care and industrial automation. Most focus is given to when these solutions work well, but they fail when presented with out of distribution inputs. RL policies share the same faults as most machine learning models. Out of distribution detection for RL is generally not well covered in the literature, and there is a lack of benchmarks for this task. In this work we propose a benchmark to evaluate OOD detection methods in a Reinforcement Learning setting, by modifying the physical parameters of non-visual standard environments or corrupting the state observation for visual environments. We discuss ways to generate custom RL environments that can produce OOD data, and evaluate three uncertainty methods for the OOD detection task. Our results show that ensemble methods have the best OOD detection performance with a lower standard deviation across multiple environments. Link » Aaqib Parvez Mohammed · Matias Valdenegro-Toro 🔗 - Learning from Guided Play: A Scheduled Hierarchical Approach for Improving Exploration in Adversarial Imitation Learning (Poster) []   link »    Effective exploration continues to be a significant challenge that prevents the deployment of reinforcement learning for many physical systems. This is particularly true for systems with continuous and high-dimensional state and action spaces, such as robotic manipulators. The challenge is accentuated in the sparse rewards setting, where the low-level state information required for the design of dense rewards is unavailable. Adversarial imitation learning (AIL) can partially overcome this barrier by leveraging expert-generated demonstrations of optimal behaviour and providing, essentially, a replacement for dense reward information. Unfortunately, the availability of expert demonstrations does not necessarily improve an agent’s capability to explore effectively and, as we empirically show, can lead to inefficient or stagnated learning. We present Learning from Guided Play (LfGP), a framework in which we leverage expert demonstrations of, in addition to a main task, multiple auxiliary tasks. Subsequently, a hierarchical model is used to learn each task reward and policy through a modified AIL procedure, in which exploration of all tasks is enforced via a scheduler composing different tasks together. This affords many benefits: improved learning efficiency for main tasks with challenging bottleneck transitions, learning can be completed without true reward functions, reusable expert data between tasks, and transfer learning through the reuse of learned auxiliary task models becomes possible. Our experimental results in a challenging multitask robotic manipulation domain indicate that our method compares favourably to supervised imitation learning and to a state-of-the-art AIL method. Link » Trevor Ablett · Bryan Chan · Jonathan Kelly 🔗 - Off-Policy Correction For Multi-Agent Reinforcement Learning (Poster) []   link »    Multi-agent reinforcement learning (MARL) provides a framework for problems involving multiple interacting agents. Despite apparent similarity to the single-agent case, multi-agent problems are often harder to train and analyze theoretically. In this work, we propose MA-Trace, a new on-policy actor-critic algorithm, which extends V-Trace to the MARL setting. The key advantage of our algorithm is its high scalability in a multi-worker setting. To this end, MA-Trace utilizes importance sampling as an off-policy correction method, which allows distributing the computations with no impact on the quality of training. Furthermore, our algorithm is theoretically grounded -- we prove a fixed-point theorem that guarantees convergence. We evaluate the algorithm extensively on the StarCraft Multi-Agent Challenge, a standard benchmark for multi-agent algorithms. MA-Trace achieves high performance on all its tasks and exceeds state-of-the-art results on some of them. Link » Michał Zawalski · Błażej Osiński · Henryk Michalewski · Piotr Miłoś 🔗 - Bayesian Exploration for Lifelong Reinforcement Learning (Poster) []   link »    A central question in reinforcement learning (RL) is how to leverage prior knowledge to accelerate learning in new tasks. We propose a Bayesian exploration method for lifelong reinforcement learning (BLRL) that aims to learn a Bayesian posterior that distills the common structure shared across different tasks. We further derive a sample complexity analysis of BLRL in the finite MDP setting. To scale our approach, we propose a variational Bayesian Lifelong Learning (VBLRL) algorithm that is based on Bayesian neural networks, can be combined with recent model-based RL methods, and exhibits backward transfer. Experimental results on three challenging domains show that our algorithms adapt to new tasks faster than state-of-the-art lifelong RL methods. Link » Haotian Fu · Shangqun Yu · Michael Littman · George Konidaris 🔗 - A Modern Self-Referential Weight Matrix That Learns to Modify Itself (Poster) []   link »    The weight matrix (WM) of a neural network (NN) is its program. The programs of many traditional NNs are learned through gradient descent in some error function, then remain fixed. The WM or program of a self-referential NN, however, can keep rapidly modifying all of itself during runtime. In principle, such NNs can meta-learn to learn, and meta-meta-learn to meta-learn to learn, and so on, in the sense of recursive self-improvement. Here we revisit such NNs, building upon recent successes of fast weight programmers (FWPs) and closely related linear Transformers. We propose a scalable self-referential WM (SRWM) that uses outer products and the delta update rule to modify itself.We evaluate our SRWM in a multi-task reinforcement learning setting with procedurally generated ProcGen game environments.Our experiments demonstrate both practical applicability and competitive performance of the SRWM. Link » Kazuki Irie · Imanol Schlag · Róbert Csordás · Jürgen Schmidhuber 🔗 - Distributional Decision Transformer for Offline Hindsight Information Matching (Poster) []   link »    How to extract as much learning signal from each trajectory data has been a key problem in reinforcement learning (RL), where sample inefficiency has posed serious challenges for practical applications. Recent works have shown that using expressive policy function approximators and conditioning on future trajectory information -- such as future states in hindsight experience replay (HER) or returns-to-go in Decision Transformer (DT) -- enables efficient learning of context-conditioned policies, where at times online RL can be fully replaced by offline behavioral cloning (BC), e.g. sequence modeling. Inspired by distributional and state-marginal matching literatures in RL, we demonstrate that all these approaches are essentially doing hindsight information matching (HIM) -- training policies that can output the rest of trajectory that matches a given future state information statistics.We first present Distributional Decision Transformer (DDT) and its practical instantiation, Categorical DT, and show that this simple modification to DT can enable effective offline state-marginal matching that generalizes well to unseen, even synthetic multi-modal, reward or state-feature distributions.We perform experiments on Gym's MuJoCo continuous control benchmarks and empirically validate performances. Additionally, we present and test another simple modification to DT called Unsupervised DT (UDT), show its connection to distribution matching, inverse RL and representation learning, and empirically demonstrate their effectiveness for offline imitation learning. To the best of our knowledge, DDT and UDT together constitute the first successes for offline state-marginal matching and inverse-RL imitation learning, allowing us to propose first benchmarks for these two important subfields and greatly expand the role of powerful sequence modeling architectures in modern RL. Link » Hiroki Furuta · Yutaka Matsuo · Shixiang (Shane) Gu 🔗 - Offline Policy Selection under Uncertainty (Poster) []   link » The presence of uncertainty in policy evaluation significantly complicates the process of policy ranking and selection in real-world settings. We formally consider offline policy selection as learning preferences over a set of policy prospects given a fixed experience dataset. While one can select or rank policies based on point estimates of their expected values or high-confidence intervals, access to the full distribution over one's belief of the policy value enables more flexible selection algorithms under a wider range of downstream evaluation metrics. We propose a Bayesian approach for estimating this belief distribution in terms of posteriors of distribution correction ratios derived from stochastic constraints. Empirically, despite being Bayesian, the credible intervals obtained are competitive with state-of-the-art frequentist approaches in confidence interval estimation. More importantly, we show how the belief distribution may be used to rank policies with respect to arbitrary downstream policy selection metrics, and empirically demonstrate that this selection procedure significantly outperforms existing approaches, such as ranking policies according to mean or high-confidence lower bound value estimates. Link » Sherry Yang · Bo Dai · Ofir Nachum · George Tucker · Dale Schuurmans 🔗 - Learning Transferable Motor Skills with Hierarchical Latent Mixture Policies (Poster) []   link »    For robots operating in the real world, it is desirable to learn reusable behaviours that can effectively be transferred and adapted to numerous tasks and scenarios.We propose an approach to learn abstract motor skills from data using a hierarchical mixture latent variable model.In contrast to existing work, our method exploits a three-level hierarchy of both discrete and continuous latent variables, to capture a set of high-level behaviours while allowing for variance in how they are executed.We demonstrate in manipulation domains that the method can effectively cluster offline data into distinct, executable behaviours, while retaining the flexibility of a continuous latent variable model.The resulting skills can be transferred and fine-tuned on new tasks, unseen objects, and from state to vision-based policies, yielding better sample efficiency and asymptotic performance compared to existing skill- and imitation-based methods.We further analyse how and when the skills are most beneficial: they encourage directed exploration to cover large regions of the state space relevant to the task, making them most effective in challenging sparse-reward settings. Link » Dushyant Rao · Fereshteh Sadeghi · Leonard Hasenclever · Markus Wulfmeier · Martina Zambelli · Giulia Vezzani · Dhruva Tirumala · Yusuf Aytar · Josh Merel · Nicolas Heess · Raia Hadsell 🔗 - Wish you were here: Hindsight Goal Selection for long-horizon dexterous manipulation (Poster) []   link »    Complex sequential tasks in continuous-control settings often require agents to successfully traverse a set of narrow passages'' in their state space.Solving such tasks with a sparse reward in a sample-efficient manner poses a challenge to modern reinforcement learning (RL) due to the associated long-horizon nature of the problem and the lack of sufficient positive signal during learning. Various tools have been applied to address this challenge. When available, large sets of demonstrations can guide agent exploration. Hindsight relabelling on the other hand does not require additional sources of information. However, existing strategies explore based on task-agnostic goal distributions, which can render the solution of long-horizon tasks impractical.In this work, we extend hindsight relabelling mechanisms to guide exploration along task-specific distributions implied by a small set of successful demonstrations.We evaluate the approach on four complex, single and dual arm, robotics manipulation tasks against strong suitable baselines.The method requires far fewer demonstrations to solve all tasks and achieves a significantly higher overall performance as task complexity increases. Finally, we investigate the robustness of the proposed solution with respect to the quality of input representations and the number of demonstrations. Link » Todor Davchev · Oleg Sushkov · Jean-Baptiste Regli · Stefan Schaal · Yusuf Aytar · Markus Wulfmeier · Jonathan Scholz 🔗 - Continuously Discovering Novel Strategies via Reward-Switching Policy Optimization (Poster) []   link »    We present Reward-Switching Policy Optimization (RSPO), a paradigm to dis-cover diverse strategies in complex RL environments by iteratively finding novelpolicies that are both locally optimal and sufficiently different from existing ones.To encourage the learning policy to consistently converge towards a previouslyundiscovered local optimum, RSPO switches between extrinsic and intrinsic re-wards via a trajectory-based novelty measurement during the optimization process.When a sampled trajectory is sufficiently distinct, RSPO performs standard policyoptimization with extrinsic rewards. For trajectories with high likelihood underexisting policies, RSPO utilizes an intrinsic diversity reward to promote exploration.Experiments show that RSPO is able to discover a wide spectrum of strategies in avariety of domains, ranging from single-agent particle-world tasks and MuJoCocontinuous control to multi-agent stag-hunt games and StarCraftII challenges. Link » Zihan Zhou · Wei Fu · Bingliang Zhang · Yi Wu 🔗 - Neighborhood Mixup Experience Replay: Local Convex Interpolation for Improved Sample Efficiency in Continuous Control Tasks (Poster) []   link »    The human brain is remarkably sample efficient, capable of learning complex behaviors by meaningfully combining previous experiences to simulate novel ones, even when few experiences are available. To improve sample efficiency in continuous control tasks, we take inspiration from this learning phenomenon. We propose Neighborhood Mixup Experience Replay (NMER), a modular replay buffer that interpolates transitions with their closest neighbors in normalized state-action space. NMER preserves a locally linear approximation of the transition manifold by only interpolating transitions with similar state-action features. Under NMER, a given transition’s set of state-action neighbors is dynamic and episode agnostic, in turn encouraging greater policy generalizability via cross-episode interpolation. We combine our approach with recent off-policy reinforcement learning algorithms and evaluate on several continuous control environments. We observe that NMER improves sample efficiency over other state-of-the-art replay buffers, enabling agents to effectively recombine previous experience and learn from limited data. Link » Ryan Sander · Wilko Schwarting · Tim Seyde · Igor Gilitschenski · Sertac Karaman · Daniela Rus 🔗 - Cross-Domain Imitation Learning via Optimal Transport (Poster) []   link »    Cross-domain imitation learning studies how to leverage expert demonstrations of one agent to train an imitation agent with a different embodiment or morphology. Comparing trajectories and stationary distributions between the expert and imitation agents is challenging because they live on different systems that may not even have the same dimensionality. We propose Gromov-Wasserstein Imitation Learning (GWIL), a method for cross-domain imitation that uses the Gromov-Wasserstein distance to align and compare states between the different spaces of the agents. Our theory formally characterizes the scenarios where GWIL preserves optimality, revealing its possibilities and limitations. We demonstrate the effectiveness of GWIL in non-trivial continuous control domains ranging from simple rigid transformation of the expert domain to arbitrary transformation of the state-action space. Link » Arnaud Fickinger · Samuel Cohen · Stuart Russell · Brandon Amos 🔗 - Lifting the veil on hyper-parameters for value-baseddeep reinforcement learning (Poster) []   link »    Successful applications of deep reinforcement learning (deep RL) combine algorithmic design and careful hyper-parameter selection. The former often comes from iterative improvements over existing algorithms, while the latter is either inherited from prior methods or tuned for the specific method being introduced. Although critical to a method's performance, the effect of the various hyper-parameter choices are often overlooked in favour of algorithmic advances. In this paper, we perform an initial empirical investigation into a number of often-overlooked hyper-parameters for value-based deep RL agents, demonstrating their varying levels of importance. We conduct this study on a varied set of classic control environments which helps highlight the effect each environment has on an algorithm's hyper-parameter sensitivity. Link » João Madeira Araújo · Johan Obando Ceron · Pablo Samuel Castro 🔗 - Reward Uncertainty for Exploration in Preference-based Reinforcement Learning (Poster) []   link »    Conveying complex objectives to reinforcement learning (RL) agents often requires meticulous reward engineering. Preference-based RL methods are able to learn a more flexible reward model based on human preferences by actively incorporating human feedback, i.e. teacher's preferences between two clips of behaviors. However, poor feedback-efficiency still remains a problem in current preference-based RL algorithms, as tailored human feedback is very expensive. To handle this issue, previous methods have mainly focused on improving query selection and policy initialization. At the same time, recent exploration methods have proven to be a recipe for improving sample-efficiency in RL. We present an exploration method specifically for preference-based RL algorithms. Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward. Specifically, we utilize disagreement across ensemble of learned reward models. Our intuition is that disagreement in learned reward model reflects uncertainty in tailored human feedback and could be useful for exploration. Our experiments show that reward uncertainty exploration improves both feedback- and sample-efficiency of preference-based RL algorithms on complex robot manipulation tasks from Meta-World benchmarks, compared with other existing exploration methods that measure the novelty of state visitation. Link » Xinran Liang · Katherine Shu · Kimin Lee · Pieter Abbeel 🔗 - TransDreamer: Reinforcement Learning with Transformer World Models (Poster) []   link »    The Dreamer agent provides various benefits of Model-Based Reinforcement Learning (MBRL) such as sample efficiency, reusable knowledge, and safe planning. However, its world model and policy networks inherit the limitations of recurrent neural networks and thus an important question is how an MBRL framework can benefit from the recent advances of transformers and what the challenges are in doing so. In this paper, we propose a transformer-based MBRL agent, called TransDreamer. We first introduce the Transformer State-Space Model, a world model that leverages a transformer for dynamics predictions. We then share this world model with a transformer-based policy network and obtain stability in training a transformer-based RL agent. In experiments, we apply the proposed model to 2D visual RL and 3D first-person visual RL tasks both requiring long-range memory access for memory-based reasoning. We show that the proposed model outperforms Dreamer in these complex tasks. Link » · Jaesik Yoon · Yi-Fu Wu · Sungjin Ahn 🔗 - Learning Parameterized Task Structure for Generalization to Unseen Entities (Poster) []   link »    Real world tasks are hierarchical and compositional. Tasks can be composed of multiple subtasks (or sub-goals) that are dependent on each other. These subtasks are defined in terms of entities (e.g., "apple", "pear") that can be recombined to form new subtasks (e.g., "pickup apple", and "pickup pear"). To solve these tasks efficiently, an agent must infer subtask dependencies (e.g. an agent must execute "pickup apple" before "place apple in pot"), and generalize the inferred dependencies to new subtasks (e.g. "place apple in pot" is similar to "place apple in pan"). Moreover, an agent may also need to solve unseen tasks, which can involve unseen entities. To this end, we formulate parameterized subtask graph inference (PSGI), a method for modeling subtask dependencies using first-order logic with factored entities. To facilitate this, we learn parameter attributes in a zero-shot manner, which are used as quantifiers (e.g. is_pickable(X)) for the factored subtask graph. We show this approach accurately learns the latent structure on hierarchical and compositional tasks more efficiently than prior work, and show PSGI can generalize by modelling structure on subtasks unseen during adaptation. Link » Anthony Liu · Sungryull Sohn · Honglak Lee 🔗 - The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models (Poster) []   link »    Reward hacking---where RL agents exploit gaps in misspecified proxy rewards---has been widely observed, but not yet systematically studied. To understand reward hacking, we construct four RL environments with different misspecified rewards. We investigate reward hacking as a function of agent capabilities: model capacity, action space resolution, and observation space noise. Typically, more capable agents are able to better exploit reward misspecifications, causing them to attain higher proxy reward and lower true reward. Moreover, we find instances of \emph{phase transitions}: capability thresholds at which the agent's behavior qualitatively shifts, leading to a sharp decrease in the true reward. Such phase transitions pose challenges to monitoring the safety of ML systems. To encourage further research on reward misspecification, we propose an anomaly detection task for aberrant policies and offer several baseline detectors. Link » Alex Pan · Kush Bhatia · Jacob Steinhardt 🔗 - Learning a Subspace of Policies for Online Adaptation in Reinforcement Learning (Poster) []   link »    Deep Reinforcement Learning (RL) is mainly studied in a setting where the training and the testing environments are similar. But in many practical applications, these environments may differ. For instance, in control systems, the robot(s) on which a policy is learned might differ from the robot(s) on which a policy will run. It can be caused by different internal factors (e.g., calibration issues, system attrition, defective modules) or also by external changes (e.g., weather conditions). There is a need to develop RL methods that generalize well to variations of the training conditions. In this article, we consider the simplest yet hard to tackle generalization setting where the test environment is unknown at train time, forcing the agent to adapt to the system's new dynamics. This online adaptation process can be computationally expensive (e.g., fine-tuning) and cannot rely on meta-RL techniques since there is just a single train environment. To do so, we propose an approach where we learn a subspace of policies within the parameter space. This subspace contains an infinite number of policies that are trained to solve the training environment while having different parameter values. As a consequence, two policies in that subspace process information differently and exhibit different behaviors when facing variations of the train environment. Our experiments carried out over a large variety of benchmarks compare our approach with baselines, including diversity-based methods. In comparison, our approach is simple to tune, does not need any extra component (e.g., discriminator) and learns policies able to gather a high reward on unseen environments. Link » Jean-Baptiste Gaya · Laure Soulier · Ludovic Denoyer 🔗 - Adaptively Calibrated Critic Estimates for Deep Reinforcement Learning (Poster) []   link »    Accurate value estimates are important for off-policy reinforcement learning. Algorithms based on temporal difference learning typically are prone to an over- or underestimation bias building up over time. In this paper, we propose a general method called Adaptively Calibrated Critics (ACC) that uses the most recent high variance but unbiased on-policy rollouts to alleviate the bias of the low variance temporal difference targets. We apply ACC to Truncated Quantile Critics [22], which is an algorithm for continuous control that allows regulation of the bias with a hyperparameter tuned per environment. The resulting algorithm adaptively adjusts the parameter during training rendering hyperparameter search unnecessary and sets a new state of the art on the OpenAI gym continuous control benchmark among all algorithms that do not tune hyperparameters for each environment. Additionally, we demonstrate that ACC is quite general by further applying it to TD3 [11] and showing an improved performance also in this setting. Link » Nicolai Dorka · Joschka Boedecker · Wolfram Burgard 🔗 - Discriminator-Weighted Offline Imitation Learning from Suboptimal Demonstrations (Poster) []   link »    We study the problem of offline Imitation Learning (IL) where an agent aims to learn an optimal expert behavior policy without additional online environment interactions. Instead, the agent is provided with a static offline dataset of state-action-next state transition triples from both optimal and non-optimal expert behaviors. This strictly offline imitation learning problem arises in many real-world problems, where environment interactions and expert annotations are costly. Prior works that address the problem either require that expert data occupies the majority proportion of the offline dataset, or need to learn a reward function and perform offline reinforcement learning (RL) based on the learned reward function. In this paper, we propose an imitation learning algorithm to address the problem without additional steps of reward learning and offline RL training for the case when demonstrations containing large-proportion of suboptimal data. Built upon behavioral cloning (BC), we introduce an additional discriminator to distinguish expert and non-expert data, we propose a cooperation strategy to boost the performance of both tasks, this will result in a new policy learning objective and surprisingly, we find its equivalence to a generalized BC objective, where the outputs of discriminator serve as the weights of the BC loss function. Experimental results show that the proposed algorithm can learn behavior policies that are much closer to the optimal policies than policies learned by baseline algorithms. Link » Haoran Xu · Xianyuan Zhan · Honglei Yin · 🔗 - Task-driven Discovery of Perceptual Schemas for Generalization in Reinforcement Learning (Poster) []   link »    Deep reinforcement learning (Deep RL) has recently seen significant progress in developing algorithms for generalization. However, most algorithms target a single type of generalization setting. In this work, we study generalization across three disparate task structures: (a) tasks composed of spatial and temporal compositions of regularly occurring object motions; (b) tasks composed of active perception of and navigation towards regularly occurring 3D objects; and (c) tasks composed of navigating through sequences of regularly occurring object-configurations. These diverse task structures all share an underlying idea of compositionality: task completion always involves combining reoccurring segments of task-oriented perception and behavior. We hypothesize that an agent can generalize within a task structure if it can discover representations that capture these reoccurring task-segments. For our tasks, this corresponds to representations for recognizing individual object motions, for navigation towards 3D objects, and for navigating through object-configurations. Taking inspiration from cognitive science, we term representations for reoccurring segments of an agent's experience, "perceptual schemas". We propose Composable Perceptual Schemas (CPS), which learns a composable state representation where perceptual schemas are distributed across multiple, relatively small recurrent "subschema" modules. Our main technical novelty is an expressive attention function that enables subschemas to dynamically attend to features shared across all positions in the agent's observation. Our experiments indicate our feature-attention mechanism enables CPS to generalize better than recurrent architectures that attend to observations with spatial attention. Link » Wilka Carvalho · Andrew Lampinen · Kyriacos Nikiforou · Felix Hill · Murray Shanahan 🔗 - Meta Arcade: A Configurable Environment Suite for Deep Reinforcement Learning and Meta-Learning (Poster) []   link »    Most approaches to deep reinforcement learning (DRL) attempt to solve a single task at a time. As a result, most existing research benchmarks consist of individual games or suites of games that have common interfaces but little overlap in their perceptual features, objectives, or reward structures. To facilitate research into knowledge transfer among trained agents (e.g. via multi-task and meta-learning), more environment suites that provide configurable tasks with enough commonality to be studied collectively are needed. In this paper we present Meta Arcade, a tool to easily define and configure custom 2D arcade games that share common visuals, state spaces, action spaces, game components, and scoring mechanisms. Meta Arcade differs from prior environments in that both task commonality and configurability are prioritized: entire sets of games can be constructed from common elements, and these elements are adjustable through exposed parameters. We include a suite of 24 predefined games that collectively illustrate the possibilities of this framework and discuss how these games can be configured for research applications. We provide several experiments that illustrate how Meta Arcade could be used, including single-task benchmarks of predefined games, sample curriculum-based approaches that change game parameters over a set schedule, and an exploration of transfer learning between games. Link » Edward Staley · Jared Markowitz · Kapil Katyal 🔗 - Hindsight Foresight Relabeling for Meta-Reinforcement Learning (Poster) []   link »    Meta-reinforcement learning (meta-RL) algorithms allow for agents to learn new behaviors from small amounts of experience, mitigating the sample inefficiency problem in RL. However, while meta-RL agents can adapt quickly to new tasks at test time after experiencing only a few trajectories, the meta-training process is still sample-inefficient. Prior works have found that in the multi-task RL setting, relabeling past transitions and thus sharing experience among tasks can improve sample efficiency and asymptotic performance. We apply this idea to the meta-RL setting and devise a new relabeling method called Hindsight Foresight Relabeling (HFR). We construct a relabeling distribution using the combination of "hindsight", which is used to relabel trajectories using reward functions from the training task distribution, and "foresight", which takes the relabeled trajectories and computes the utility of each trajectory for each task. HFR is easy to implement and readily compatible with existing meta-RL algorithms. We find that HFR improves performance when compared to other relabeling methods on a variety of meta-RL tasks. Link » Michael Wan · Jian Peng · Tanmay Gangwani 🔗 - CIC: Contrastive Intrinsic Control for Unsupervised Skill Discovery (Poster) []   link »    We introduce Contrastive Intrinsic Control (CIC) - an algorithm for unsupervised skill discovery that maximizes the mutual information between skills and state transitions. In contrast to most prior approaches, CIC uses a decomposition of the mutual information that explicitly incentivizes diverse behaviors by maximizing state entropy. We derive a novel lower bound estimate for the mutual information which combines a particle estimator for state entropy to generate diverse behaviors and contrastive learning to distill these behaviors into distinct skills. We evaluate our algorithm on the Unsupervised Reinforcement Learning Benchmark, which consists of a long reward-free pre-training phase followed by a short adaptation phase to downstream tasks with extrinsic rewards. We find that CIC improves on prior unsupervised skill discovery methods by $91\%$ and the next-leading overall exploration algorithm by $26\%$ in terms of downstream task performance. Link » Misha Laskin · Hao Liu · Xue Bin Peng · Denis Yarats · Aravind Rajeswaran · Pieter Abbeel 🔗 - Continuous Control With Ensemble Deep Deterministic Policy Gradients (Poster) []   link »    The growth of deep reinforcement learning (RL) has brought multiple exciting tools and methods to the field. This rapid expansion makes it important to understand the interplay between individual elements of the RL toolbox. We approach this task from an empirical perspective by conducting a study in the continuous control setting. We present multiple insights of fundamental nature, including: a commonly used additive action noise is not required for effective exploration and can even hinder training; the performance of policies trained using existing methods varies significantly across training runs, epochs of training, and evaluation runs; the critics' initialization plays the major role in ensemble-based actor-critic exploration, while the training is mostly invariant to the actors' initialization; a strategy based on posterior sampling explores better than the approximated UCB combined with the weighted Bellman backup; the weighted Bellman backup alone cannot replace the clipped double Q-Learning. As a conclusion, we show how existing tools can be brought together in a novel way, giving rise to the Ensemble Deep Deterministic Policy Gradients (ED2) method, to yield state-of-the-art results on continuous control tasks from $\mbox{OpenAI Gym MuJoCo}$. From the practical side, ED2 is conceptually straightforward, easy to code, and does not require knowledge outside of the existing RL toolbox. Link » Piotr Januszewski · Mateusz Olko · Michał Królikowski · Jakub Swiatkowski · Marcin Andrychowicz · Łukasz Kuciński · Piotr Miłoś 🔗 - What Would the Expert $do(\cdot)$?: Causal Imitation Learning (Poster) []   link »    We develop algorithms for imitation learning from data that was corrupted by unobserved confounders. Sources of such confounding include (a) persistent perturbations to actions or (b) the expert responding to a part of the state that the learner does not have access to. When a confounder affects multiple timesteps of recorded data, it can manifest as spurious correlations between states and actions that a learner might latch onto, leading to poor policy performance. By utilizing the effect of past states on current states, we are able to break up these spurious correlations, an application of the econometric technique of instrumental variable regression. This insight leads to two novel algorithms, one of a generative-modeling flavor ($\texttt{DoubIL}$) that can utilize access to a simulator and one of a game-theoretic flavor ($\texttt{ResiduIL}$) that can be run offline. Both approaches are able to find policies that match the result of a query to an unconfounded expert. We find both algorithms compare favorably to non-causal approaches on simulated control problems. Link » Gokul Swamy · Sanjiban Choudhury · James Bagnell · Steven Wu 🔗 - Grounding Aleatoric Uncertainty in Unsupervised Environment Design (Poster) []   link »    In reinforcement learning (RL), adaptive curricula have proven highly effective for learning policies that generalize well under a wide variety of changes to the environment. Recently, the framework of Unsupervised Environment Design (UED) generalized notions of curricula for RL in terms of generating entire environments, leading to the development of new methods with robust minimax-regret properties. However, in partially-observable or stochastic settings (those featuring aleatoric uncertainty), optimal policies may depend on the ground-truth distribution over the aleatoric features of the environment. Such settings are potentially problematic for curriculum learning, which necessarily shifts the environment distribution used during training with respect to the fixed ground-truth distribution in the intended deployment environment. We formalize this phenomenon as curriculum-induced covariate shift, and describe how, when the distribution shift occurs over such aleatoric environment parameters, it can lead to learning suboptimal policies. We then propose a method which, given black-box access to a simulator, corrects this resultant bias by aligning the advantage estimates to the ground-truth distribution over aleatoric parameters. This approach leads to a minimax-regret UED method, SAMPLR, with Bayes-optimal guarantees. Link » Minqi Jiang · Michael Dennis · Jack Parker-Holder · Andrei Lupu · Heiner Kuttler · Edward Grefenstette · Tim Rocktäschel · Jakob Foerster 🔗 - SURF: Semi-supervised Reward Learning with Data Augmentation for Feedback-efficient Preference-based Reinforcement Learning (Poster) []   link »    Preference-based reinforcement learning (RL) has shown potential for teaching agents to perform the target tasks without a costly, pre-defined reward function by learning the reward with a supervisor’s preference between the two agent behaviors. However, preference-based learning often requires a large amount of human feedback, making it difficult to apply this approach to various applications. This data-efficiency problem, on the other hand, has been typically addressed by using unlabeled samples or data augmentation techniques in the context of supervised learning. Motivated by the recent success of these approaches, we present SURF, a semi-supervised reward learning framework that utilizes a large amount of unlabeled samples with data augmentation. In order to leverage unlabeled samples for reward learning, we infer pseudo-labels of the unlabeled samples based on the confidence of the preference predictor. To further improve the label-efficiency of reward learning, we introduce a new data augmentation that temporally crops consecutive subsequences from the original behaviors. Our experiments demonstrate that our approach significantly improves the feedback-efficiency of the state-of-the-art preference-based method on a variety of locomotion and robotic manipulation tasks. Link » Jongjin Park · Younggyo Seo · Jinwoo Shin · Honglak Lee · Pieter Abbeel · Kimin Lee 🔗 - Task-Induced Representation Learning (Poster) []   link »    A major bottleneck for applying deep reinforcement learning to real-world problems is its sample inefficiency, particularly when training policies from high-dimensional inputs such as images. A number of recent works use unsupervised representation learning approaches to improve sample efficiency. Yet, such unsupervised approaches are fundamentally unable to distinguish between task-relevant and irrelevant information. Thus, in visually complex scenes they learn representations that model lots of task-irrelevant details and hence lead to slower downstream task learning. Our insight: to determine which parts of the scene are important and should be modeled, we can exploit task information, such as rewards or demonstrations, from previous tasks. To this end, we formalize the problem of task-induced representation learning (TARP), which aims to leverage such task information in offline experience from prior tasks for learning compact representations that focus on modelling only task-relevant aspects. Through a series of experiments in visually complex environments we compare different approaches for leveraging task information within the TARP framework with prior unsupervised representation learning techniques and (1) find that task-induced representations allow for more sample efficient learning of unseen tasks and (2) formulate a set of best-practices for task-induced representation learning. Link » Jun Yamada · Karl Pertsch · Anisha Gunjal · Joseph Lim 🔗 - OVD-Explorer: A General Information-theoretic Exploration Approach for Reinforcement Learning (Poster) []   link »    Many exploration strategies are built upon the optimism in the face of the uncertainty (OFU) principle for reinforcement learning. However, without considering the aleatoric uncertainty, existing methods may over-explore the state-action pairs with large randomness and hence are non-robust. In this paper, we explicitly capture the aleatoric uncertainty from a distributional perspective and propose an information-theoretic exploration method named Optimistic Value Distribution Explorer (OVD-Explorer). OVD-Explorer follows the OFU principle, but more importantly, it avoids exploring the areas with high aleatoric uncertainty through maximizing the mutual information between policy and the upper bounds of policy's returns. Furthermore, to make OVD-Explorer tractable for continuous RL, we derive a closed form solution, and integrate it with SAC, which, to our knowledge, for the first time alleviates the negative impact on exploration caused by aleatoric uncertainty for continuous RL. Empirical evaluations on the commonly used Mujoco benchmark and a novel GridChaos task demonstrate that OVD-Explorer can alleviate over-exploration and outperform state-of-the-art methods. Link » Jinyi Liu · Zhi Wang · YAN ZHENG · Jianye Hao · Junjie Ye · Chenjia Bai · Pengyi Li 🔗 - GrASP: Gradient-Based Affordance Selection for Planning (Poster) []   link » The ability to plan using a learned model is arguably a key component of intelligence. There are several challenges in realising such a component in large-scale reinforcement learning (RL) problems. One such challenge is dealing effectively with continuous action spaces when using tree-search planning (e.g., it is not feasible to consider every action even at just the root node of the tree). In this paper, we present a method for discovering affordances useful for planning---for learning which a small number of actions/options from a continuous space of actions/options to consider in the tree-expansion process during planning. We consider affordances that are goal-and-state-conditional mappings to actions/options as well as unconditional affordances that simply select actions/options available in all states. Our discovery method is gradient-based: we compute gradients through the planning procedure to update the parameters of the function that represents affordances. Our empirical work shows that it is indeed feasible to learn both primitive-action and option affordances in this way and that model-based RL while simultaneously learning affordances and a value-equivalent model can outperform model-free RL. Link » Vivek Veeriah · Zeyu Zheng · Richard L Lewis · Satinder Singh 🔗 - Beyond Target Networks: Improving Deep $Q$-learning with Functional Regularization (Poster) []   link »    A majority of recent successes in deep Reinforcement Learning are based on minimization of square Bellman error. The training is often unstable due to a fast-changing target $Q$-values, and target networks are employed to stabilize by using an additional set of lagging parameters. Despite their advantages, target networks could inhibit the propagation of newly-encountered rewards which may ultimately slow down the training. In this work, we address this issue by augmenting the squared Bellman error with a functional regularizer. Unlike target networks', the regularization here is explicit which not only enables us to use up-to-date parameters but also control the regularization. This leads to a fast yet stable training method. Across a range of Atari environments, we demonstrate empirical improvements over target-network based methods in terms of both sample efficiency and performance. In summary, our approach provides a fast and stable alternative to replace the standard squared Bellman error. Link » Alexandre Piche · Joe Marino · Gian Maria Marconi · Valentin Thomas · Chris Pal · Emtiyaz Khan 🔗 - No DICE: An Investigation of the Bias-Variance Tradeoff in Meta-Gradients (Poster) []   link »    Meta-gradients provide a general approach for optimizing the meta-parameters of reinforcement learning (RL) algorithms. Estimation of meta-gradients is central to the performance of these meta-algorithms, and has been studied in the setting of MAML-style short-horizon meta-RL problems. In this context, prior work has investigated the estimation of the Hessian of the RL objective, as well as tackling the problem of credit assignment to pre-adaptation behavior by making a sampling correction. However, we show that Hessian estimation, implemented for example by DiCE and its variants, always add bias and can also add variance to meta-gradient estimation. DiCE-like approaches are therefore unlikely to lie on Pareto frontier of the bias-variance tradeoff and should not be pursued in the context of meta-gradients for RL. Meanwhile, the sampling correction has not been studied in the important long-horizon setting, where the inner optimization trajectories must be truncated for computational tractability. We study the bias and variance tradeoff induced by truncated backpropagation in combination with a weighted sampling correction. While prior work has implicitly chosen points in this bias-variance space, we disentangle the sources of bias and variance and present an empirical study which relates existing estimators to each other. Link » Risto Vuorio · Jacob Beck · Greg Farquhar · Jakob Foerster · Shimon Whiteson 🔗 - Block Contextual MDPs for Continual Learning (Poster) []   link »    In reinforcement learning (RL), when defining a Markov Decision Process (MDP), the environment dynamics is implicitly assumed to be stationary. This assumption of stationarity, while simplifying, can be unrealistic in many scenarios. In the continual reinforcement learning scenario, the sequence of tasks is another source of nonstationarity. In this work, we propose to examine this continual reinforcement learning setting through the Block Contextual MDP (BC-MDP) framework, which enables us to relax the assumption of stationarity. This framework challenges RL algorithms to handle both nonstationarity and rich observation settings and, by additionally leveraging smoothness properties, enables us to study generalization bounds for this setting. Finally, we take inspiration from adaptive control to propose a novel algorithm that addresses the challenges introduced by this more realistic BC-MDP setting, allows for zero-shot adaptation at evaluation time, and achieves strong performance on several nonstationary environments. Link » Shagun Sodhani · Franziska Meier · Joelle Pineau · Amy Zhang 🔗 - PFPN: Continuous Control of Physically Simulated Characters using Particle Filtering Policy Network (Poster) []   link »    Data-driven methods for physics-based character control using reinforcement learning have been successfully applied to generate high-quality motions. However, existing approaches typically rely on Gaussian distributions to represent the action policy, which can prematurely commit to suboptimal actions when solving high-dimensional continuous control problems for highly-articulated characters. In this paper, to improve the learning performance of physics-based character controllers, we propose a framework that considers a particle-based action policy as a substitute for Gaussian policies. We exploit particle filtering to dynamically explore and discretize the action space, and track the posterior policy represented as a mixture distribution. The resulting policy can replace the unimodal Gaussian policy which has been the staple for character control problems, without changing the underlying model architecture of the reinforcement learning algorithm used to perform policy optimization. We demonstrate the applicability of our approach on various motion capture imitation tasks. Baselines using our particle-based policies achieve better imitation performance and speed of convergence as compared to corresponding implementations using Gaussians, and are more robust to external perturbations during character control. Link » Pei Xu · Ioannis Karamouzas 🔗 - Recurrent Off-policy Baselines for Memory-based Continuous Control (Poster) []   link »    When the environment is partially observable (PO), a deep reinforcement learning (RL) agent must learn a suitable temporal representation of the entire history in addition to a strategy to control. This problem is not novel, and there have been model-free and model-based algorithms proposed for this problem. However, inspired by recent success in model-free image-based RL, we noticed the absence of a model-free baseline for history-based RL that (1) uses full history and (2) incorporates recent advances in off-policy continuous control. Therefore, we implement recurrent versions of DDPG, TD3, and SAC (RDPG, RTD3, and RSAC) in this work, evaluate them on short-term and long-term PO domains, and investigate key design choices. Our experiments show that RDPG and RTD3 can surprisingly fail on some domains and that RSAC is the most reliable, reaching near-optimal performance on nearly all domains. However, one task that requires systematic exploration still proved to be difﬁcult, even for RSAC. These results show that model-free RL can learn good temporal representation using only reward signals; the primary difﬁculty seems to be computational cost and exploration. To facilitate future research, we have made our PyTorch implementation publicly available. Link » Zhihan Yang · Nguyen Hai 🔗 - A Framework for Efficient Robotic Manipulation (Poster) []   link »    Recent advances in unsupervised representation learning significantly improved the sample efficiency of training Reinforcement Learning policies in simulated environments. However, similar gains have not yet been seen for real-robot learning. In this work, we focus on enabling data-efficient real-robot learning from pixels. We present a Framework for Efficient Robotic Manipulation (FERM), a method that utilizes data augmentation and unsupervised learning to achieve sample-efficient training of real-robot arm policies from sparse rewards. While contrastive pre-training, data augmentation, and demonstrations are alone insufficient for efficient learning, our main contribution is showing that the combination of these disparate techniques results in a simple yet data-efficient method. We show that, given only 10 demonstrations, a single robotic arm can learn sparse-reward manipulation policies from pixels, such as reaching, picking, moving, pulling a large object, flipping a switch, and opening a drawer in just 30 minutes of mean real-world training time. Link » Albert Zhan · Ruihan Zhao · Lerrel Pinto · Pieter Abbeel · Misha Laskin 🔗 - Transfer RL across Observation Feature Spaces via Model-Based Regularization (Poster) []   link »    In many reinforcement learning (RL) applications, the observation space is specified by human developers and restricted by physical realizations, and may thus be subject to dramatic changes over time (e.g. increased number of observable features). However, when the observation space changes, the previous policy usually fails due to the mismatch of input features, and therefore one has to train another policy from scratch, which is computationally and sample inefficient. In this paper, we propose a novel algorithm that extracts the latent-space dynamics in the source task, and transfers the dynamics model to the target task with a model-based regularizer. Theoretical analysis shows that the transferred dynamics model helps with representation learning in the target task. Our algorithm works for drastic changes of observation space (e.g. from vector-based observation to image-based observation), without any inter-task mapping or any prior knowledge of the target task. Empirical results have justified that our algorithm significantly improves the efficiency and stability of learning in the target task. Link » Yanchao Sun · Ruijie Zheng · Xiyao Wang · Andrew Cohen · Furong Huang 🔗 - Embodiment perspective of reward definition for behavioural homeostasis (Poster) []   link »    In this work, we propose a neural homeostat, a neural machine that stabilises the internal physiological state through interactions with the environment. Based on this framework, we demonstrate that behavioural homeostasis with low-level continuous motor control emerges from an embodied agent using only rewards computed by the agent's local information. Using the bodily state of the embodied agent as the reward source, the complexity of the reward definition is outsourced' into the coupled dynamics of the bodily state and the environment. Therefore, our definition of the reward is simple, but the optimised behaviour of the agent can be surprisingly complex. Our contributions are 1) an extension of homeostatic reinforcement learning to enable continuous motor control using deep reinforcement learning; 2) a comparison of homeostatic reward definitions from previous studies, where we found that homeostatic rewards using the difference of the drive function performed best; and 3) a demonstration of the emergence of adaptive behaviour from low-level motor control through direct optimisation of the homeostatic objective. Link » Naoto Yoshida · Yasuo Kuniyoshi 🔗 - Communication-Efficient Actor-Critic Methods for Homogeneous Markov Games (Poster) []   link »    Recent success in cooperative multi-agent reinforcement learning (MARL) relies on centralized training and policy sharing. Centralized training eliminates the issue of non-stationarity MARL yet induces large communication costs, and policy sharing is empirically crucial to efficient learning in certain tasks yet lacks theoretical justification. In this paper, we formally characterize a subclass of cooperative Markov games where agents exhibit a certain level of homogeneity such that policy sharing provably incurs no suboptimality. This enables us to develop the first consensus-based decentralized actor-critic method where the consensus update is applied to both the actors and the critics while ensuring convergence. We also develop practical algorithms based on our decentralized actor-critic method to reduce the communication cost during training, while still yielding policies comparable with centralized training. Link » Dingyang Chen · Yile Li · Qi Zhang 🔗 - URLB: Unsupervised Reinforcement Learning Benchmark (Poster) []   link »    Deep Reinforcement Learning (RL) has emerged as a powerful paradigm to solve a range of complex yet specific control tasks. Training generalist agents that can quickly adapt to new tasks remains an outstanding challenge. Recent advances in unsupervised RL have shown that pre-training RL agents with self-supervised intrinsic rewards can result in efficient adaptation. However, these algorithms have been hard to compare and develop due to the lack of a unified benchmark. To this end, we introduce the Unsupervised Reinforcement Learning Benchmark (URLB). URLB consists of two phases: reward-free pre-training and downstream task adaptation with extrinsic rewards. Building on the DeepMind Control Suite, we provide twelve continuous control tasks from three domains for evaluation and open-source code for eight leading unsupervised RL methods. We find that the implemented baselines make progress but are not able to solve URLB and propose directions for future research. Link » Misha Laskin · Denis Yarats · Hao Liu · Kimin Lee · Albert Zhan · Kevin Lu · Catherine Cang · Lerrel Pinto · Pieter Abbeel 🔗 - Offline Reinforcement Learning with In-sample Q-Learning (Poster) []   link » Offline reinforcement learning requires reconciling two conflicting aims: learning a policy that improves over the behavior policy that collected the dataset, while at the same time minimizing the deviation from the behavior policy so as to avoid errors due to distributional shift. This tradeoff is critical, because most current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy, and therefore need to either constrain these actions to be in-distribution, or else regularize their values. We propose a new offline RL method that never needs to evaluate actions outside of the dataset, but still enables the learned policy to improve substantially over the best behavior in the data through generalization. The main insight in our work is that, instead of evaluating unseen actions from the latest policy, we can approximate the policy improvement step implicitly by treating the state value function as a random variable, with randomness determined by the action (while still integrating over the dynamics to avoid excessive optimism), and then taking a state conditional upper expectile of this random variable to estimate the value of the best actions in that state. This leverages the generalization capacity of the function approximator to estimate the value of the best available action at a given state without ever directly querying a Q-function with this unseen action. Our algorithm alternates between fitting this upper expectile value function and backing it up into a Q-function, without any explicit policy. Then, we extract the policy via advantage-weighted behavioral cloning, which also avoids querying out-of-sample actions. We dub our method in-sample Q-learning (IQL). IQL is easy to implement, computationally efficient, and only requires fitting an additional critic with an asymmetric L2 loss. IQL demonstrates the state-of-the-art performance on D4RL, a standard benchmark for offline reinforcement learning. We also demonstrate that IQL achieves strong performance fine-tuning using online interaction after offline initialization. Link » Ilya Kostrikov · Ashvin Nair · Sergey Levine 🔗 - Wasserstein Distance Maximizing Intrinsic Control (Poster) []   link »    This paper deals with the problem of learning a skill-conditioned policy that acts meaningfully in the absence of a reward signal. Mutual information based objectives have shown some success in learning skills that reach a diverse set of states in this setting. These objectives include a KL-divergence term, which is maximized by visiting distinct states even if those states are not far apart in the MDP. This paper presents an approach that rewards the agent for learning skills that maximize the Wasserstein distance of their state visitation from the start state of the skill. It shows that such an objective leads to a policy that covers more distance in the MDP than diversity based objectives, and validates the results on a variety of Atari environments. Link » Ishan Durugkar · Steven Hansen · Stephen Spencer · Volodymyr Mnih · Ishan Durugkar 🔗 - Augmenting Reinforcement Learning with Behavior Primitives for Diverse Manipulation Tasks (Poster) []   link »    Realistic manipulation tasks require a robot to interact with an environment with a prolonged sequence of motor actions. While deep reinforcement learning methods have recently emerged as a promising paradigm for automating manipulation behaviors, they usually fall short in long-horizon tasks due to the exploration burden. This work introduces MAnipulation Primitive-augmented reinforcement LEarning (MAPLE), a learning framework that augments standard reinforcement learning algorithms with a pre-defined library of behavior primitives. These behavior primitives are robust functional modules specialized in achieving manipulation goals, such as grasping and pushing. To use these heterogeneous primitives, we develop a hierarchical policy that involves the primitives and instantiates their executions with input parameters. We demonstrate that MAPLE outperforms baseline approaches by a significant margin on a suite of simulated manipulation tasks. We also quantify the compositional structure of the learned behaviors and highlight our method's ability to transfer policies to new task variants and to physical hardware. Link » Soroush Nasiriany · Huihan Liu · Yuke Zhu 🔗 - Strength Through Diversity: Robust Behavior Learning via Mixture Policies (Poster) []   link »    Efficiency in robot learning is highly dependent on hyperparameters. Robot morphology and task structure differ widely and finding the optimal setting typically requires sequential or parallel repetition of experiments, strongly increasing the interaction count. We propose a training method that only relies on a single trial by enabling agents to select and combine controller designs conditioned on the task. Our Hyperparameter Mixture Policies (HMPs) feature diverse sub-policies that vary in distribution types and parameterization, reducing the impact of design choices and unlocking synergies between low-level components. We demonstrate strong performance on the DeepMind Control Suite, Meta-World tasks and a simulated ANYmal robot, showing that HMPs yield robust, data-efficient learning. Link » Tim Seyde · Wilko Schwarting · Igor Gilitschenski · Markus Wulfmeier · Daniela Rus 🔗 - Long-Term Credit Assignment via Model-based Temporal Shortcuts (Poster) []   link »    This work explores the question of long-term credit assignment in reinforcement learning. Assigning credit over long distances has historically been difficult in both reinforcement learning and recurrent neural networks, where discounting or gradient truncation respectively are often necessary for feasibility, but limit the model's ability to reason over longer time scales. We propose LVGTS, a novel model-based algorithm that bridges the gap between the two fields. By using backpropagation through a latent model and temporal shortcuts to directly propagate gradients, LVGTS assigns credit from the future to the possibly distant past regardless of the use of discounting or gradient truncation. We show, on simple but carefully-designed problems, that our approach is able to perform effective credit assignment even in the presence of distractions. Link » Michel Ma · Pierluca D'Oro · Yoshua Bengio · Pierre-Luc Bacon 🔗 - C-Planning: An Automatic Curriculum for Learning Goal-Reaching Tasks (Poster) []   link »    Goal-conditioned reinforcement learning (RL) has shown great success recently at solving a wide range of tasks(e.g., navigation, robotic manipulation). However, learning to reach distant goals remains a central challenge to the field, and the task is particularly hard without any expert demonstrations and reward shaping. In this paper, we propose to solve the distant goal-reaching task by using search at training time to generate a curriculum of intermediate states. Specifically, we introduce the algorithm Classifier Planning (C-Planning) by framing the learning of the goal-conditioned policies as variational inference. C-Planningnaturally follows expectation maximization (EM): the E step corresponds to planning an optimal sequence of waypoints using graph search, while the M step corresponds to learning a goal-conditioned policy to reach those waypoints. One essential difficulty of designing such an algorithm is accurately modeling the distribution over waypoints to sample from. In C-Planning, we propose to sample the waypoints using contrastive methods to learn a value function. Unlike prior methods that combine goal-conditioned RL with graph search, ours performs search only during training and not testing, significantly decreasing the compute costs of deploying the learned policy. Empirically, we demonstrate that our method not only improves the sample efficiency of prior methods but also successfully solves temporally-extended navigation and manipulation tasks, where prior goal-conditioned RL methods (including those based on graph search) fail to solve. Link » Tianjun Zhang · Ben Eysenbach · Russ Salakhutdinov · Sergey Levine · Joseph Gonzalez 🔗 - General Characterization of Agents by States they Visit (Poster) []   link »    Behavioural characterizations (BCs) of decision-making agents, or their policies, are used to study outcomes of training algorithms and as part of the algorithms themselves to encourage unique policies, match expert policy or restrict changes to policy per update. However, previously presented solutions are not applicable in general, either due to lack of expressive power, computational constraint or constraints on the policy or environment. Furthermore, many BCs rely on the actions of policies. We discuss and demonstrate how these BCs can be misleading, especially in stochastic environments, and propose a novel solution based on what states policies visit. We run experiments to evaluate the quality of the proposed BC against baselines and evaluate their use in studying training algorithms, novelty search and trust-region policy optimization. Link » Anssi Kanervisto · Ville Hautamäki 🔗 - TARGETED ENVIRONMENT DESIGN FROM OFFLINE DATA (Poster) []   link »    In reinforcement learning (RL) the use of simulators is ubiquitous, allowing cheaper and safer agent training than training directly in the real target environment. However, this approach relies on the simulator being a sufficiently accurate reflection of the target environment, which is difficult to achieve in practice. Accordingly, recent methods have proposed an alternative paradigm, utilizing offline datasets from the target environment to train an agent, avoiding online access to either the target or any simulated environment but leading to poor generalization outside the support of the offline data. Here, we propose to combine these two paradigms to leverage both offline datasets and synthetic simulators. We formalize our approach as offline targeted environment design(OTED), which automatically learns a distribution over simulator parameters to match a provided offline dataset, and then uses the learned simulator to train an RL agent in standard online fashion. We derive an objective for learning the simulator parameters which corresponds to minimizing a divergence between the target offline dataset and the state-action distribution induced by the simulator. We evaluate our method on standard offlineRL benchmarks and show that it yields impressive results compared to existing approaches, thus successfully leveraging both offline datasets and simulators for better RL. Link » Izzeddin Gur · Ofir Nachum · Aleksandra Faust 🔗 - GPU-Podracer: Scalable and Elastic Library for Cloud-Native Deep Reinforcement Learning (Poster) []   link »    Deep reinforcement learning (DRL) has revolutionized learning and actuation in applications such as game playing and robotic control. The cost of data collection, i.e., generating transitions from agent-environment interactions, remains a major challenge for wider DRL adoption in complex real-world problems. Following a cloud-native paradigm to train DRL agents on a GPU cloud platform is a promising solution. In this paper, we present a scalable and elastic library \textit{GPU-podracer} for cloud-native deep reinforcement learning, which efficiently utilizes millions of GPU cores to carry out massively parallel agent-environment interactions. At a high-level, GPU-podracer employs a tournament-based ensemble scheme to orchestrate the training process on hundreds or even thousands of GPUs, scheduling the interactions between a leaderboard and a training pool with hundreds of pods. At a low-level, each pod simulates agent-environment interactions in parallel by fully utilizing nearly $7,000$ GPU CUDA cores in a single GPU. Our GPU-podracer library features high scalability, elasticity and accessibility by following the development principles of containerization, microservices and MLOps. Using an NVIDIA DGX SuperPOD cloud, we conduct extensive experiments on various tasks in locomotion and stock trading and show that GPU-podracer outperforms Stable Baseline3 and RLlib, e.g., GPU-podracer achieves nearly linear scaling. Link » Xiao-Yang Liu · Zhuoran Yang · Zhaoran Wang · Anwar Walid · Jian Guo · Michael Jordan 🔗 - Behavior Predictive Representations for Generalization in Reinforcement Learning (Poster) []   link »    Deep reinforcement learning (RL) agents trained on a few environments, often struggle to generalize on unseen environments, even when such environments are semantically equivalent to training environments. Such agents learn representations that overfit the characteristics of the training environments. We posit that generalization can be improved by assigning similar representations to scenarios with similar sequences of long-term optimal behavior. To do so, we propose behavior predictive representations (BPR) that capture long-term optimal behavior. BPR trains an agent to predict latent state representations multiple steps into the future such that these representations can predict the optimal behavior at the future steps. We demonstrate that BPR provides large gains on a jumping task from pixels, a problem designed to test generalization. Link » Siddhant Agarwal · Aaron Courville · Rishabh Agarwal 🔗 - Fast and Data-Efficient Training of Rainbow: an Experimental Study on Atari (Poster) []   link »    Across the Arcade Learning Environment, Rainbow achieves a level of performance competitive with humans and modern RL algorithms. However, attaining this level of performance requires large amounts of data and hardware resources, making research in this area computationally expensive and use in practical applications often infeasible. This paper's contribution is threefold: We (1) propose an improved version of Rainbow, seeking to drastically reduce Rainbow's data, training time, and compute requirements while maintaining its competitive performance; (2) we empirically demonstrate the effectiveness of our approach through experiments on the Arcade Learning Environment, and (3) we conduct a number of ablation studies to investigate the effect of the individual proposed modifications. Our improved version of Rainbow reaches a median human normalized score close to classic Rainbow's, while using 20 times less data and requiring only 7.5 hours of training time on a single GPU. We also provide our full implementation including pre-trained models. Link » Dominik Schmidt · Thomas Schmied 🔗 - Implicit Behavioral Cloning (Poster) []   link »    We find that across a wide range of robot policy learning scenarios, treating supervised policy learning with an implicit model generally performs better, on average, than commonly used explicit models. We present extensive experiments on this finding, and we provide both intuitive insight and theoretical arguments distinguishing the properties of implicit models compared to their explicit counterparts, particularly with respect to approximating complex, potentially discontinuous and multi-valued (set-valued) functions. On robotic policy learning tasks we show that implicit behavioral cloning policies with energy-based models (EBM) often outperform common explicit (Mean Square Error, or Mixture Density) counterparts, including on tasks with high-dimensional action spaces and visual image inputs. We find these policies provide competitive results or outperform state-of-the-art offline reinforcement learning methods on the challenging human-expert tasks from the D4RL benchmark suite, despite using no reward information. In the real world, robots with implicit policies can learn complex and remarkably subtle behaviors on contact-rich tasks from human demonstrations, including tasks with high combinatorial complexity and tasks requiring 1mm precision. Link » Pete Florence · Corey Lynch · Andy Zeng · Oscar Ramirez · Ayzaan Wahid · Laura Downs · Adrian Wong · Igor Mordatch · Jonathan Tompson 🔗 - Policy Gradients Incorporating the Future (Poster) []   link »    Reasoning about the future -- understanding how decisions in the present time affect outcomes in the future -- is one of the central challenges for reinforcement learning (RL), especially in highly-stochastic or partially observable environments. While predicting the future directly is hard, in this work we introduce a method that allows an agent to look into the future'' without explicitly predicting it. Namely, we propose to allow an agent, during its training on past experience, to observe what \emph{actually} happened in the future at that time, while enforcing an information bottleneck to avoid the agent overly relying on this privileged information. Coupled with recent advances in variational inference and a latent-variable autoregressive model, this gives our agent the ability to utilize rich and \emph{useful} information about the future trajectory dynamics in addition to the present. Our method, Policy Gradients Incorporating the Future (PGIF), is easy to implement and versatile, being applicable to virtually any policy gradient algorithm. We apply our proposed method to a number of off-the-shelf RL algorithms and show that PGIF is able to achieve higher reward faster in a variety of online and offline RL domains, as well as sparse-reward and partially observable environments. Link » David Venuto · Elaine Lau · Doina Precup · Ofir Nachum 🔗 - TempoRL: Temporal Priors for Exploration in Off-Policy Reinforcement Learning (Poster) []   link »    Effective exploration is a crucial challenge in deep reinforcement learning. Behavioral priors have been shown to tackle this problem successfully, at the expense of reduced generality and restricted transferability. We thus propose temporal priors as a non-Markovian generalization of behavioral priors for guiding exploration in reinforcement learning. Critically, we focus on state-independent temporal priors, which exploit the idea of temporal consistency and are generally applicable and capable of transferring across a wide range of tasks. We show how dynamically sampling actions from a probabilistic mixture of policy and temporal prior can accelerate off-policy reinforcement learning in unseen downstream tasks. We provide empirical evidence that our approach improves upon strong baselines in long-horizon continuous control tasks under sparse reward settings. Link » Marco Bagatella · Sammy Christen · Otmar Hilliges 🔗 - Dynamic Mirror Descent based Model Predictive Control for Accelerating Robot Learning (Poster) []   link »    Recent works in Reinforcement Learning (RL) combine model-free (Mf)-RL algorithms with model-based (Mb)-RL approaches to get the best from both: asymptotic performance of Mf-RL and high sample-efficiency of Mb-RL. Inspired by these works, we propose a hierarchical framework that integrates online learning for the Mb-trajectory optimization with off-policy methods for the Mf-RL. In particular, two loops are proposed, where the Dynamic Mirror Descent based Model Predictive Control (DMD-MPC) is used as the inner loop Mb-RL to obtain an optimal sequence of actions. These actions are in turn used to significantly accelerate the outer loop Mf-RL. We show that our formulation is generic for a broad class of MPC based policies and objectives, and includes some of the well-known Mb-Mf approaches. We finally introduce a new algorithm: Mirror-Descent Model Predictive RL (M-DeMoRL), which uses Cross-Entropy Method (CEM) with elite fractions for the inner loop. Our experiments show faster convergence of the proposed hierarchical approach on benchmark MuJoCo tasks. We also demonstrate hardware training for trajectory tracking in a 2R leg, and hardware transfer for robust walking in a quadruped. We show that the inner-loop Mb-RL significantly decreases the number of training iterations required in the real system, thereby validating the proposed approach. Link » Utkarsh A Mishra · Soumya Samineni · Shalabh Bhatnagar Bhatnagar · Shishir N Y 🔗 - Exploring through Random Curiosity with General Value Functions (Poster) []   link »    Exploration in reinforcement learning through intrinsic rewards has previously been addressed by approaches based on state novelty or artificial curiosity. In partially observable settings where observations look alike, state novelty can lead to intrinsic reward vanishing prematurely. On the other hand, curiosity-based approaches require modeling precise environment dynamics which are potentially quite complex. Here we propose random curiosity with general value functions (RC-GVF), an intrinsic reward function that connects state novelty and artificial curiosity. Instead of predicting the entire environment dynamics, RC-GVF predicts temporally extended values through general value functions (GVFs) and uses the prediction error as an intrinsic reward. In this way, our approach generalizes a popular approach called random network distillation (RND) by encouraging behavioral diversity and reduces the need for additional maximum entropy regularization. Our experiments on four procedurally generated partially observable environments indicate that our approach is competitive to RND and could be beneficial in environments that require behavioural exploration. Link » Aditya Ramesh · Louis Kirsch · Sjoerd van Steenkiste · Jürgen Schmidhuber 🔗 - Maximum Entropy Model-based Reinforcement Learning (Poster) []   link »    Recent advances in reinforcement learning have demonstrated its ability to solve hard agent-environment interaction tasks on a super-human level. However, the application of reinforcement learning methods to a practical and real-world tasks is currently limited due to most RL state-of-art algorithms' sample inefficiency, i.e., the need for a vast number of training episodes. For example, OpenAI Five algorithm that has beaten human players in Dota 2 has trained for thousands of years of game time. Several approaches exist that tackle the issue of sample inefficiency, that either offer a more efficient usage of already gathered experience or aim to gain a more relevant and diverse experience via a better exploration of an environment. However, to our knowledge, no such approach exist for model-based algorithms, that showed their high sample efficiency in solving hard control tasks with high-dimensional state space. This work connects exploration techniques and model-based reinforcement learning. We have designed a novel exploration method that takes into account features of the model-based approach. We also demonstrate through experiments that our method significantly improves the performance of model-based algorithm Dreamer. Link » Oleg Svidchenko · Aleksei Shpilman 🔗 - Exponential Family Model-Based Reinforcement Learning via Score Matching (Poster) []   link »    We propose a optimistic model-based algorithm, dubbed SMRL, for finite-horizon episodic reinforcement learning (RL) when the transition model is specified by exponential family distributions with $d$ parameters and the reward is bounded and known. SMRL uses score matching, an unnormalized density estimation technique that enables efficient estimation of the model parameter by ridge regression. SMRL achieves $\tilde O(d\sqrt{H^3T})$ regret, where $H$ is the length of each episode and $T$ is the total number of interactions. Link » Gene Li · Junbo Li · Nathan Srebro · Zhaoran Wang · Zhuoran Yang 🔗 - Imitation Learning from Pixel Observations for Continuous Control (Poster) []   link »    We study imitation learning from visual observations only for controlling dynamical systems with continuous states and actions. This setting is attractive due to the large amount of video data available from which agents could learn from. However, it is challenging due to $i)$ not observing the actions and $ii)$ the high-dimensional visual space. In this setting, we explore recipes for imitation learning based on adversarial learning and optimal transport. These recipes enable us to scale these methods to attain expert-level performance on visual continuous control tasks in the DeepMind control suite. We investigate the tradeoffs of these approaches and present a comprehensive evaluation of the key design choices. To encourage reproducible research in this area, we provide an easy-to-use implementation for benchmarking visual imitation learning, including our methods and expert demonstrations. Link » Samuel Cohen · Brandon Amos · Marc Deisenroth · Mikael Henaff · Eugene Vinitsky · Denis Yarats 🔗 - Covariate Shift of Latent Confounders in Imitation and Reinforcement Learning (Poster) []   link »    We consider the problem of using expert data with unobserved confounders for imitation and reinforcement learning. We begin by defining the problem of learning from confounded expert data in a contextual MDP setup. We analyze the limitations of learning from such data with and without external reward and propose an adjustment of standard imitation learning algorithms to fit this setup. In addition, we discuss the problem of distribution shift between the expert data and the online environment when partial observability is present in the data. We prove possibility and impossibility results for imitation learning under arbitrary distribution shift of the missing covariates. When additional external reward is provided, we propose a sampling procedure that addresses the unknown shift and prove convergence to an optimal solution. Finally, we validate our claims empirically on challenging assistive healthcare and recommender system simulation tasks. Link » Guy Tennenholtz · Assaf Hallak · Gal Dalal · Shie Mannor · Gal Chechik · Uri Shalit 🔗 - Latent Geodesics of Model Dynamics for Offline Reinforcement Learning (Poster) []   link »    Model-based offline reinforcement learning approaches generally rely on bounds of model error. While contemporary methods achieve such bounds through an ensemble of models, we propose to estimate them using a data-driven latent metric. Particularly, we build upon recent advances in Riemannian geometry of generative models to construct a latent metric of an encoder-decoder based forward model. Our proposed metric measures both the quality of out of distribution samples as well as the discrepancy of examples in the data. We show that our metric can be viewed as a combination of two metrics, one relating to proximity and the other to epistemic uncertainty. Finally, we leverage our metric in a pessimistic model-based framework, showing a significant improvement upon contemporary model-based offline reinforcement learning benchmarks. Link » Guy Tennenholtz · Nir Baram · Shie Mannor 🔗 - An Empirical Study of Non-Uniform Sampling in Off-Policy Reinforcement Learning for Continuous Control (Poster) []   link »    Off-policy reinforcement learning (RL) algorithms can take advantage of samples generated from all previous interactions with the environment through "experience replay". Such methods outperform almost all on-policy and model-based alternatives in complex tasks where a structured or well parameterized model of the world does not exist. This makes them desirable for practitioners who lack domain specific knowledge, but who still require high sample efficiency. However this high performance can come at a cost. Because of additional hyperparameters introduced to efficiently learn function approximators, off-policy RL can perform poorly on new problems. To address parameter sensitivity, we show how the correct choice of non-uniform sampling for experience replay can stabilize model performance under varying environmental conditions and hyper-parameters. Link » Nicholas Ioannidis · Jonathan Lavington · Mark Schmidt 🔗 - On Using Hamiltonian Monte Carlo Sampling for Reinforcement Learning Problems in High-dimension (Poster) []   link »    Value function based reinforcement learning (RL) algorithms, for example, $Q$-learning, learn optimal policies from datasets of actions, rewards, and state transitions. However, when the underlying state transition dynamics are stochastic and evolve on a high-dimensional space, generating independent and identically distributed (IID) data samples for creating these datasets poses a significant challenge due to the intractability of the associated normalizing integral. In these scenarios, Hamiltonian Monte Carlo (HMC) sampling offers a computationally tractable way to generate data for training RL algorithms. In this paper, we introduce a framework, called Hamiltonian $Q$-Learning, that demonstrates, both theoretically and empirically, that $Q$ values can be learned from a dataset generated by HMC samples of actions, rewards, and state transitions. Furthermore, to exploit the underlying low-rank structure of the $Q$ function, Hamiltonian $Q$-Learning uses a matrix completion algorithm for reconstructing the updated $Q$ function from $Q$ value updates over a much smaller subset of state-action pairs. Thus, by providing an efficient way to apply $Q$-learning in stochastic, high-dimensional settings, the proposed approach broadens the scope of RL algorithms for real-world applications. Link » Udari Madhushani · Biswadip Dey · Naomi Leonard · Amit Chakraborty 🔗 - Skill Preferences: Learning to Extract and Execute Robotic Skills from Human Feedback (Poster) []   link »    A promising approach to solving challenging long-horizon tasks has been to extract behavior priors (skills) by fitting generative models to large offline datasets of demonstrations. However, such generative models inherit the biases of the underlying data and result in poor and unusable skills when trained on imperfect demonstration data. To better align skill extraction with human intent we present Skill Preferences (SkiP), an algorithm that learns a model over human preferences and uses it to extract human-aligned skills from offline data. After extracting human-preferred skills, SkiP also utilizes human feedback to solve downstream tasks with RL. We show that SkiP enables a simulated kitchen robot to solve complex multi-step manipulation tasks and substantially outperforms prior leading RL algorithms with human preferences as well as leading skill extraction algorithms without human preferences. Link » Xiaofei Wang · Kimin Lee · Kourosh Hakhamaneshi · Pieter Abbeel · Misha Laskin 🔗 - That Escalated Quickly: Compounding Complexity by Editing Levels at the Frontier of Agent Capabilities (Poster) []   link »    Deep Reinforcement Learning (RL) has recently produced impressive results in a series of settings such as games and robotics. However, a key challenge that limits the utility of RL agents for real-world problems is the agent's ability to generalize to unseen variations (or levels). To train more robust agents, the field of Unsupervised Environment Design (UED) seeks to produce a curriculum by updating both the agent and the distribution over training environments. Recent advances in UED have come from promoting levels with high regret, which provides theoretical guarantees in equilibrium and empirically has been shown to produce agents capable of zero-shot transfer to unseen human-designed environments. However, current methods require either learning an environment-generating adversary, which remains a challenging optimization problem, or curating a curriculum from randomly sampled levels, which is ineffective if the search space is too large. In this paper we instead propose to evolve a curriculum, by making edits to previously selected levels. Our approach, which we call Adversarially Compounding Complexity by Editing Levels (ACCEL), produces levels at the frontier of an agent's capabilities, resulting in curricula that start simple but become increasingly complex. ACCEL maintains the theoretical benefits of prior works, while outperforming them empirically when transferring to complex out-of-distribution environments. Link » Jack Parker-Holder · Minqi Jiang · Michael Dennis · Mikayel Samvelyan · Jakob Foerster · Edward Grefenstette · Tim Rocktäschel 🔗 - The Information Geometry of Unsupervised Reinforcement Learning (Poster) []   link »    How can a reinforcement learning (RL) agent prepare to solve downstream tasks if those tasks are not known a priori? One approach is unsupervised skill discovery, a class of algorithms that learn a set of policies without access to a reward function. Such algorithms bear a close resemblance to representation learning algorithms (e.g., contrastive learning) in supervised learning, in that both are pretraining algorithms that maximize some approximation to a mutual information objective. While prior work has shown that the set of skills learned by such methods can accelerate downstream RL tasks, prior work offers little analysis into whether these skill learning algorithms are optimal, or even what notion of optimality would be appropriate to apply to them. In this work, we show that unsupervised skill discovery algorithms based on mutual information maximization do not learn skills that are optimal for every possible reward function. However, we show that the distribution over skills provides an optimal initialization minimizing regret against adversarially-chosen reward functions, assuming a certain type of adaptation procedure. Our analysis also provides a geometric perspective on these skill learning methods. Link » Ben Eysenbach · Russ Salakhutdinov · Sergey Levine 🔗 - Mismatched No More: Joint Model-Policy Optimization for Model-Based RL (Poster) []   link »    Many model-based reinforcement learning (RL) methods follow a similar template: fit a model to previously observed data, and then use data from that model for RL or planning. However, models that achieve better training performance (e.g., lower MSE) are not necessarily better for control: an RL agent may seek out the small fraction of states where an accurate model makes mistakes, or it might act in ways that do not expose the errors of an inaccurate model. As noted in prior work, there is an objective mismatch: models are useful if they yield good policies, but they are trained to maximize their accuracy, rather than the performance of the policies that result from them. In this work we propose a model-learning objective that directly optimizes a model to be useful for model-based RL. This objective, which depends on samples from the learned model, is a (global) lower bound on the expected return in the real environment. We jointly optimize the policy and model using this one objective, thus mending the objective mismatch in prior work. The resulting algorithm (MnM) is conceptually similar to a GAN: a classifier distinguishes between real and fake transitions, the model is updated to produce transitions that look realistic, and the policy is updated to avoid states where the model predictions are unrealistic. Our theory justifies the intuition that the best dynamics for learning a good policy are not necessarily the correct dynamics. Link » Ben Eysenbach · Alexander Khazatsky · Sergey Levine · Russ Salakhutdinov 🔗 - Graph Backup: Data Efficient Backup Exploiting Markovian Data (Poster) []   link »    Bootstrapped value estimation has become a widely adopted ingredient for modern reinforcement learning algorithms. These methods compute a target value based on observed data and predictions for future values. The approximation error of the target value, which comes from stochastic dynamics and inaccurate predictions, can significantly affect the data efficiency of RL algorithms. Multi-step methods, such as n-step Q learning and TD(lambda), leverage the chain structure of the data, alleviating the effect of inaccurate predictions and allowing credit assignment across a longer time horizon. However, the main limitation of such multi-step methods is that they fail to exploit the graph structure of certain MDPs by only treating each trajectory independently, resulting in an inadequate estimate of the target value that misses the intersections between multiple trajectories. In this paper, we propose to treat the transition data of an MDP as a graph, and define a novel backup operator exploiting this graph structure. Comparing to multi-step backup, our graph backup method allows counterfactual credit assignment, and can reduce the variance that comes from stochastic environment dynamics. Our empirical evaluation on MiniGrid and Minatar shows graph backup can greatly improve data efficiency compared to one-step and multi-step backup. Link » zhengyao Jiang · Tianjun Zhang · Rob Kirk · Tim Rocktäschel · Edward Grefenstette 🔗 - Offline Meta-Reinforcement Learning with Online Self-Supervision (Poster) []   link » Meta-reinforcement learning (RL) methods can meta-train policies that adapt to new tasks with orders of magnitude less data than standard RL, but meta-training itself is costly and time-consuming. If we can meta-train on offline data, then we can reuse the same static dataset, labeled once with rewards for different tasks, to meta-train policies that adapt to a variety of new tasks at meta-test time. Although this capability would make meta-RL a practical tool for real-world use, offline meta-RL presents additional challenges beyond online meta-RL or standard offline RL settings. Meta-RL learns an exploration strategy that collects data for adapting, and also meta-trains a policy that quickly adapts to data from a new task. Since this policy was meta-trained on a fixed, offline dataset, it might behave unpredictably when adapting to data collected by the learned exploration strategy, which differs systematically from the offline data and thus induces distributional shift. We do not want to remove this distributional shift by simply adopting a conservative exploration strategy, because learning an exploration strategy enables an agent to collect better data for faster adaptation. Instead, we propose a hybrid offline meta-RL algorithm, which uses offline data with rewards to meta-train an adaptive policy, and then collects additional unsupervised online data, without any reward labels, to bridge this distribution shift. By not requiring reward labels for online collection, this data can be much cheaper to collect. We compare our method to prior work on offline meta-RL on simulated robot locomotion and manipulation tasks and find that using additional unsupervised online data collection leads to a dramatic improvement in the adaptive capabilities of the meta-trained policies, matching the performance of fully online meta-RL on a range of challenging domains that require generalization to new tasks. Link » Vitchyr Pong · Ashvin Nair · Laura Smith · Catherine Huang · Sergey Levine 🔗 - Unsupervised Learning of Temporal Abstractions using Slot-based Transformers (Poster) []   link » The discovery of reusable sub-routines simplifies decision-making and planning in complex reinforcement learning problems. Previous approaches propose to learn such temporal abstractions in a purely unsupervised fashion through observing state-action trajectories gathered from executing a policy. However, a current limitation is that they process each trajectory in an entirely sequential manner, which prevents them from revising earlier decisions about sub-routine boundary points in light of new incoming information. In this work we propose SloTTAr, a fully parallel approach that integrates sequence processing Transformers with a Slot Attention module for learning about sub-routines in an unsupervised fashion. We demonstrate how SloTTAr is capable of outperforming strong baselines in terms of boundary point discovery, while being up to $30\mathrm{x}$ faster on existing benchmarks. Link » Anand Gopalakrishnan · Kazuki Irie · Jürgen Schmidhuber · Sjoerd van Steenkiste 🔗 - Modern Hopfield Networks for Return Decomposition for Delayed Rewards (Poster) []   link »    Delayed rewards, which are separated from their causative actions by irrelevant actions, hamper learning in reinforcement learning (RL). Especially real world problems often contain such delayed and sparse rewards. Recently, return decomposition for delayed rewards (RUDDER) employed pattern recognition to remove or reduce delay in rewards, which dramatically simplifies the learning task of the underlying RL method. RUDDER was realized using a long short-term memory (LSTM). The LSTM was trained to identify important state-action pair patterns, responsible for the return. Reward was then redistributed to these important state-action pairs. However, training the LSTM is often difficult and requires a large number of episodes. In this work, we replace the LSTM with the recently proposed continuous modern Hopfield networks (MHN) and introduce Hopfield-RUDDER. MHN are powerful trainable associative memories with large storage capacity. They require only few training samples and excel at identifying and recognizing patterns. We use this property of MHN to identify important state-action pairs that are associated with low or high return episodes and directly redistribute reward to them. However, in partially observable environments, Hopfield-RUDDER requires additional information about the history of state-action pairs. Therefore, we evaluate several methods for compressing history and introduce reset-max history, a lightweight history compression using the max-operator in combination with a reset gate. We experimentally show that Hopfield-RUDDER is able to outperform LSTM-based RUDDER on various 1D environments with small numbers of episodes. Finally, we show in preliminary experiments that Hopfield-RUDDER scales to highly complex environments with the Minecraft ObtainDiamond task from the MineRL NeurIPS challenge. Link » Michael Widrich · Markus Hofmarcher · Vihang Patil · Angela Bitto · Sepp Hochreiter 🔗 - Learning Two-Player Mixture Markov Games: Kernel Function Approximation and Correlated Equilibrium (Poster) []   link » We consider learning the Nash equilibrium in two-player Markov Games with nonlinear function approximation, where the action-value function is approximated by a function in the Reproducing Kernel Hilbert Space (RKHS) space. The key challenge is how to do exploration in the high-dimensional function space. We propose novel online learning algorithms to find the Nash Equilibrium by minimizing the duality gap. At the core of our algorithms are upper and lower confidence bounds that are derived based on the principle of optimism in the face of uncertainty. We prove that our algorithm is able to attain an $O(\sqrt{T})$ regret with polynomial computational complexity, under very mild assumptions on the reward function and the underlying dynamic of the Markov games. This work provides one of the first results of desirable complexities in the learning of two-player Markov games with nonlinear function approximation in the kernel mixture settings, and its implications for function approximation via deep neural networks. Link » Chris Junchi Li · Dongruo Zhou · Quanquan Gu · Michael Jordan 🔗 - Interactive Robust Policy Optimization for Multi-Agent Reinforcement Learning (Poster) []   link » As machine learning is applied more to real-world problems like robotics, control of autonomous vehicles, drones, and recommendation systems, it becomes essential to consider the notion of agency where multiple agents with local observations start impacting each other and interact to achieve their goals. Multi-agent reinforcement learning (MARL) is concerned with developing learning algorithms that can discover effective policies in multi-agent environments. In this work, we develop algorithms for addressing two critical challenges in MARL - non-stationarity and robustness. We show that naive independent reinforcement learning does not preserve the strategic game-theoretic interaction between the agents, and we present a way to realize the classical infinite order recursion reasoning in a reinforcement learning setting. We refer to this framework as Interactive Policy Optimization (IPO) and derive four MARL algorithms using centralized-training-decentralized-execution that generalize the widely used single-agent policy gradient methods to multi-agent settings. Finally, we provide a method to estimate opponent's parameters in adversarial settings using maximum likelihood and integrate IPO with an adversarial learning framework to train agents robust to destabilizing disturbances from the environment/adversaries and for better sim2real transfer from simulated multi-agent environments to the real world. Link » Videh Nema · Balaraman Ravindran 🔗 - Stability Analysis in Mixed-Autonomous Traffic with Deep Reinforcement Learning (Poster) []   link »    With the development of deep neural networks and artificial intelligence, Autonomous Driving Systems (ADS) are developing rapidly. According to the commercialization of Autonomous Vehicles (AVs), non-AVs and AVs will drive simultaneously on the road. The stability of autonomous vehicles can significantly affect the entire road condition. In this study, we use a Deep Reinforcement Learning (DRL) approach to making an AV learn a reasonable lane-changing and the acceleration control to keep the desired velocity. For the learning efficiency of the AV, it provides minimal state information and replaces the lane-changing action space with a lower level. Therefore, we modified the action selection method of TD3 and used it. Finally, the driving performance of the TD3-based AV and the LC2013-based vehicle is compared in various environments. The TD3-based AV performed better than the LC 2013. Link » Dongsu Lee · Minhae Kwon 🔗 - Understanding the Effects of Dataset Composition on Offline Reinforcement Learning (Poster) []   link »    The promise of Offline Reinforcement Learning (RL) lies in learning policies from fixed datasets, without interacting with the environment. Being unable to interact makes the dataset one of the most essential ingredient of the algorithm and has a large influence on the performance of the learned policy. Studies on how the dataset composition influences various Offline RL algorithms are missing currently. Towards that end, we conducted a comprehensive empirical analysis on the effect of dataset composition towards the performance of Offline RL algorithms for discrete action environments. The performance is studied through two metrics of the datasets, Trajectory Quality (TQ) and State-Action Coverage (SACo). Our analysis suggests that variants of the off-policy Deep-Q-Network family rely on the dataset to exhibit high SACo. Contrary to that, algorithms that constrain the learned policy towards the data generating policy perform well across datasets, if they exhibit high TQ or SACo or both. For datasets with high TQ, Behavior Cloning outperforms or performs similarly to the best Offline RL algorithms. Link » Kajetan Schweighofer · Markus Hofmarcher · Marius-Constantin Dinu · Philipp Renz · Angela Bitto · Vihang Patil · Sepp Hochreiter 🔗 - Learning Efficient Multi-Agent Cooperative Visual Exploration (Poster) []   link »    We consider the task of visual indoor exploration with multiple agents, where the agents need to cooperatively explore the entire indoor region using as few steps as possible. Classical planning-based methods often suffer from particularly expensive computation at each inference step and a limited expressiveness of cooperation strategy. By contrast, reinforcement learning (RL) has become a trending paradigm for tackling this challenge due to its modeling capability of arbitrarily complex strategies and minimal inference overhead. We extend the state-of-the-art single-agent RL solution, Active Neural SLAM (ANS), to the multi-agent setting by introducing a novel RL-based global-goal planner, Spatial Coordination Planner (SCP), which leverages spatial information from each individual agent in an end-to-end manner and effectively guides the agents to navigate towards different spatial goals with high exploration efficiency. SCP consists of a transformer-based relation encoder to capture intra-agent interactions and a spatial action decoder to produce accurate goals. In addition, we also implement a few multi-agent enhancements to process local information from each agent for an aligned spatial representation and more precise local planning. Our final solution, Multi-Agent Active Neural SLAM (MAANS), combines all these techniques and substantially outperforms 4 different planning-based methods and various RL baselines in the photo-realistic physical testbed, Habitat. Link » Chao Yu · Jiaxuan Gao · Huazhong Yang · Yu Wang · Yi Wu 🔗 - Mean-Variance Efficient Reinforcement Learning by Expected Quadratic Utility Maximization (Poster) []   link »    Risk management is critical in decision making, and \emph{mean-variance} (MV) trade-off is one of the most common criteria.However, in reinforcement learning (RL) for sequential decision making under uncertainty, most of the existing methods for MV control suffer from computational difficulties caused by the \emph{double sampling} problem. In this paper, in contrast to strict MV control, we consider learning MV efficient policies that achieve Pareto efficiency regarding MV trade-off. To achieve this purpose, we train an agent to maximize the expected quadratic utility function, a common objective of risk management in finance and economics. We call our approach direct expected quadratic utility maximization (EQUM). The EQUM does not suffer from the double sampling issue because it does not include gradient estimation of variance. We confirm that the maximizer of the objective in the EQUM directly corresponds to an MV efficient policy under a certain condition. We conduct experiments with benchmark settings to demonstrate the effectiveness of the EQUM. Link » Masahiro Kato · Kei Nakagawa · Kenshi Abe · Tetsuro Morimura 🔗 - Learning compositional tasks from language instructions (Poster) []   link »    Systematic compositionality - the ability to combine learned knowledge and skills to solve novel tasks -- is a key aspect of generalization in humans that allows us to understand and perform tasks described by novel language utterances. While progress has been made in supervised learning settings, no work has yet studied compositional generalization of a reinforcement learning agent following natural language instructions in an embodied environment. We develop a set of tasks in a photo-realistic simulated kitchen environment that allow us to study the degree to which a behavioral policy captures the systematicity in language by studying its zero-shot generalization performance on held out natural language instructions. We show that our agent which leverages a novel additive action-value decomposition in tandem with attention-based subgoal prediction is able to exploit composition in text instructions to generalize to unseen tasks. Link » Lajanugen Logeswaran · Wilka Carvalho · Honglak Lee 🔗 - Large Scale Coordination Transfer for Cooperative Multi-Agent Reinforcement Learning (Poster) []   link »    Multi-agent environments with large numbers of agents are difficult to solve due to the complexity associated with drawing sufficient samples for learning. While recent work has addressed the possibility of using transfer learning to improve sample complexities of reinforcement learning algorithms, methods for transferring knowledge in multi-agent domains across differing numbers of agents have rarely been considered. To address the bottleneck with sampling from large scale environments, we propose a joint critic structure motivated from graph convolutional networks and coordination graphs that allows for the direct transfer of parameters into environments with varying amounts of agents. We further consider fine-tuning the transferred policy and critic networks on the target domain and provide the motivation for doing so in cooperative environments where agent behavior is determined by a subset of the total population. Finally, we provide empirical results validating our claims on such environments, including popular multi-agent benchmark environments. Link » Ethan Wang · Binghong Chen · Le Song 🔗 - Return Dispersion as an Estimator of Learning Potential for Prioritized Level Replay (Poster) []   link » Prioritized Level Replay (PLR) has been shown to induce adaptive curricula that improve the sample-efficiency and generalization of reinforcement learning policies in environments featuring multiple tasks or levels. PLR selectively samples training levels weighed by a function of recent temporal-difference (TD) errors experienced on each level. We explore the dispersion of returns as an alternative prioritization criterion to address certain issues with TD error scores. Link » Iryna Korshunova · Minqi Jiang · Jack Parker-Holder · Tim Rocktäschel · Edward Grefenstette 🔗 - Status-quo policy gradient in Multi-Agent Reinforcement Learning (Poster) []   link »    Individual rationality, which involves maximizing expected individual return, does not always lead to optimal individual or group outcomes in multi-agent problems. For instance, in social dilemma situations, Reinforcement Learning (RL) agents trained to maximize individual rewards converge to mutual defection that is individually and socially sub-optimal. In contrast, humans evolve individual and socially optimal strategies in such social dilemmas. Inspired by ideas from human psychology that attribute this behavior in humans to the status-quo bias, we present a status-quo loss (SQLoss) and the corresponding policy gradient algorithm that incorporates this bias in an RL agent. We demonstrate that agents trained with SQLoss learn individually as well as socially optimal behavior in several social dilemma matrix games. To apply SQLoss to games where cooperation and defection are determined by a sequence of non-trivial actions, we present GameDistill, an algorithm that reduces a multi-step game with visual input to a matrix game. We empirically show how agents trained with SQLoss on GameDistill reduced version of Coin Game and StagHunt evolve optimal policies. Finally, we show that SQLoss extends to a 4-agent setting by demonstrating the emergence of cooperative behavior in the popular Braess' paradox. Link » Pinkesh Badjatiya · Mausoom Sarkar · Nikaash Puri · Jayakumar Subramanian · Abhishek Sinha · Siddharth Singh · Balaji Krishnamurthy 🔗 - Deep Reinforcement Learning Explanation via Model Transforms (Poster) []   link »    Understanding the emerging behaviors of deep reinforcement learning agents may be difficult because such agents are often trained using highly complex and expressive models. In recent years, most approaches developed for explaining agent behaviors rely on domain knowledge or on an analysis of the agent’s learned policy. For some domains, relevant knowledge may not be available or may be insufficient for producing meaningful explanations. We suggest using formal model abstractions and transforms, previously used mainly for expediting the search for optimal policies, to automatically explain discrepancies that may arise between the behavior of an agent and the behavior that is anticipated by an observer. We formally define this problem of Reinforcement Learning Policy Explanation (RLPE), suggest a class of transforms which can be used for explaining emergent behaviors, and suggest methods for searching efficiently for an explanation. We demonstrate the approach on standard benchmarks. Link » Sarah Keren · Yoav Kolumbus · Jeffrey S Rosenschein · David Parkes · Mira Finkelstein 🔗 - A Meta-Gradient Approach to Learning Cooperative Multi-Agent Communication Topology (Poster) []   link » In cooperative multi-agent reinforcement learning (MARL), agents often can only partially observe the environment state, and thus communication is crucial to achieving coordination. Communicating agents must simultaneously learn to whom to communicate (i.e., communication topology) and how to interpret the received message for decision-making. Although agents can efficiently learn communication interpretation by end-to-end backpropagation, learning communication topology is much trickier since the binary decisions of whether to communicate impede end-to-end differentiation. As evidenced in our experiments, existing solutions, such as reparameterization tricks and reformulating topology learning as reinforcement learning, often fall short. This paper introduces a meta-learning framework that aims to discover and continually adapt the update rules for communication topology learning. Empirical results show that our meta-learning approach outperforms existing alternatives in a range of cooperative MARL tasks and demonstrates a reasonably strong ability to generalize to tasks different from meta-training. Preliminary analyses suggest that, interestingly, the discovered update rules occasionally resemble the human-designed rules such as policy gradients, yet remaining qualitatively different in most cases. Link » Qi Zhang · Dingyang Chen 🔗 - A Family of Cognitively Realistic Parsing Environments for Deep Reinforcement Learning (Poster) []   link »    The hierarchical syntactic structure of natural language is a key feature of human cognition that enables us to recursively construct arbitrarily long sentences supporting communication of complex, relational information. In this work, we describe a framework in which learning cognitively-realistic left-corner parsers can be formalized as a Reinforcement Learning problem, and introduce a family of cognitively realistic chart-parsing environments to evaluate potential psycholinguistic implications of RL algorithms. We report how several baseline Q-learning and Actor Critic algorithms, both tabular and neural, perform on subsets of the Penn Treebank corpus. We observe a sharp increase in difficulty as parse trees get slightly more complex, indicating that hierarchical reinforcement learning might be required to solve this family of environments. Link » Adrian Brasoveanu · Rohan Pandey · Maximilian Alfano-Smith 🔗 - OstrichRL: A Musculoskeletal Ostrich Simulation to Study Bio-mechanical Locomotion (Poster) []   link »    Muscle-actuated control is a research topic of interest spanning different fields, in particular biomechanics, robotics and graphics. This type of control is particularly challenging because models are often overactuated, and dynamics are delayed and non-linear. It is however a very well tested and tuned actuation model that has undergone millions of years of evolution and that involves interesting properties exploiting passive forces of muscle-tendon units and efficient energy storage and release. To facilitate research on muscle-actuated simulation, we release a 3D musculoskeletal simulation of an ostrich based on the MuJoCo simulator. Ostriches are one of the fastest bipeds on earth and are therefore an excellent model for studying muscle-actuated bipedal locomotion. The model is based on CT scans and dissections used to gather actual muscle data such as insertion sites, lengths and pennation angles. Along with this model, we also provide a set of reinforcement learning tasks, including reference motion tracking and a reaching task with the neck. The reference motion data are based on motion capture clips of various behaviors which we pre-processed and adapted to our model. This paper describes how the model was built and iteratively improved using the tasks. We evaluate the accuracy of the muscle actuation patterns by comparing them to experimentally collected electromyographic data from locomoting birds. We believe that this work can be a useful bridge between the biomechanics, reinforcement learning, graphics and robotics communities, by providing a fast and easy to use simulation. Link » Vittorio La Barbera · Fabio Pardo · Yuval Tassa · Petar Kormushev · John Hutchinson 🔗 - Hybrid Imitative Planning with Geometric and Predictive Costs in Offroad Environments (Poster) []   link »    Mobile robots tasked with reaching user-specified goals in open-world outdoor environments must contend with numerous challenges, including complex perception and unexpected obstacles and terrains. Prior work has addressed such problems with geometric methods that reconstruct obstacles, as well as learning-based methods. While geometric methods provide good generalization, they can be brittle in outdoor environments that violate their assumptions (e.g., tall grass). On the other hand, learning-based methods can learn to directly select collision-free paths from raw observations, but are difficult to integrate with standard geometry-based pipelines. This creates an unfortunate either-or" dichotomy -- either use learning and lose out on well-understood geometric navigational components, or do not use it, in favor of extensively hand-tuned geometry-based cost maps. The main idea of our approach is reject this dichotomy by designing the learning and non-learning-based components in a way such that they can be easily and effectively combined and created without labeling any data. Both components contribute to a planning criterion: the learned component contributes predicted traversability as rewards, while the geometric component contributes obstacle cost information. We instantiate and comparatively evaluate our system in a high-fidelity simulator. We show that this approach inherits complementary gains from both components: the learning-based component enables the system to quickly adapt its behavior, and the geometric component often prevents the system from making catastrophic errors. Link » Daniel Shin · shah · Ali Agha · Nick Rhinehart · Sergey Levine 🔗 - Accelerated Deep Reinforcement Learning of Terrain-Adaptive Locomotion Skills (Poster) []   link » Learning locomotion skills on dynamic terrains allows creating realistic animations without recording motion capture data. The simulated character is trained to navigate varying terrains avoiding obstacles with balance and agility. Model-free reinforcement learning has been used to develop such skills for simulated characters. In particular, a mixture of actor-critic experts (MACE) was recently shown to enable learning of such complex skills by promoting specialization and incorporating human knowledge. However, this approach still requires access to a very large number of training interactions and explorations with a computationally expensive simulator. We demonstrate how to accelerate model-free reinforcement learning to acquire terrain-adaptive locomotion skills, as well as decrease the need for large-scale exploration. We first generalize model-based value expansion (MVE) to a mixture of actor-critic experts, showing the conditions under which the method accelerates learning in this generalized setting. This motivates combining MACE with MVE resulting in the MACE-MVE algorithm. We then propose learning to predict future terrains, character states, rewards, and the probability of falling down via convolutional networks to speed-up learning using generalized MVE. We analyze our approach empirically showing that it can substantially speed-up learning of such challenging skills. Finally, we study the effect of various design choices to control for uncertainty and manage dynamics fidelity. Link » Khaled Refaat · Kai Ding 🔗 - CoMPS: Continual Meta Policy Search (Poster) []   link »    We develop a new continual meta-learning method to address challenges in sequential multi-task learning. In this setting, the agent's goal is to achieve high reward over any sequence of tasks quickly. Prior meta-reinforcement learning algorithms have demonstrated promising results in accelerating the acquisition of new tasks. However, they require access to all tasks during training. Beyond simply transferring past experience to new tasks, our goal is to devise continual reinforcement learning algorithms that learn to learn, using their experience on previous tasks to learn new tasks more quickly. We introduce a new method, continual meta-policy search (CoMPS), that removes this limitation by meta-training in an incremental fashion, over each task in a sequence, without revisiting prior tasks. CoMPS continuously repeats two subroutines: learning a new task using RL and using the experience from RL to perform completely offline meta-learning to prepare for subsequent task learning. We find that CoMPS outperforms prior continual learning and off-policy meta-reinforcement methods on several sequences of challenging continuous control tasks. Link » Glen Berseth · Zhiwei Zhang · Grace Zhang · Chelsea Finn · Sergey Levine 🔗 - Continuous Control with Action Quantization from Demonstrations (Poster) []   link » In Reinforcement Learning (RL), discrete actions, as opposed to continuous actions, result in less complex exploration problems and the immediate computation of the maximum of the action-value function which is central to dynamic programming-based methods. In this paper, we propose a novel method: Action Quantization from Demonstrations (AQuaDem) to learn a discretization of continuous action spaces by leveraging the priors of demonstrations. This dramatically reduces the exploration problem, since the actions faced by the agent not only are in a finite number but also are plausible in light of the demonstrator’s behavior. By discretizing the action space we can apply any discrete action deep RL algorithm to the continuous control problem. We evaluate the proposed method on three different setups: RL with demonstrations, RL with play data --demonstrations of a human playing in an environment but not solving any specific task-- and Imitation Learning. For all three setups, we only consider human data, which is more challenging than synthetic data. We found that AQuaDem consistently outperforms state-of-the-art continuous control methods, both in terms of performance and sample efficiency. Link » Robert Dadashi · Leonard Hussenot · Damien Vincent · Anton Raichuk · Matthieu Geist · Olivier Pietquin 🔗 - Investigation of Independent Reinforcement Learning Algorithms in Multi-Agent Environments (Poster) []   link »    Independent reinforcement learning algorithms have no theoretical guarantees for finding the best policy in multi-agent settings. However, in practice, prior works have reported good performance with independent algorithms in some domains and bad performance in others. Moreover, a comprehensive study of the strengths and weaknesses of independent algorithms is lacking in the literature. In this paper, we carry out an empirical comparison of the performance of independent algorithms on four PettingZoo environments that span the three main categories of multi-agent environments, i.e., cooperative, competitive, and mixed. We show that in fully-observable environments, independent algorithms can perform on par with multi-agent algorithms in cooperative and competitive settings. For the mixed environments, we show that agents trained via independent algorithms learn to perform well individually, but fail to learn to cooperate with allies and compete with enemies. We also show that adding recurrence improves the learning of independent algorithms in cooperative partially observable environments. Link » Ken Ming Lee · Sriram Ganapathi · Mark Crowley 🔗 - Expert Human-Level Driving in Gran Turismo Sport Using Deep Reinforcement Learning with Image-based Representation (Poster) []   link »    When humans play virtual racing games, they use visual environmental information on the game screen to understand the rules within the environments. In contrast, a state-of-the-art realistic racing game AI agent that outperforms human players does not use image-based environmental information but the compact and precise measurements provided by the environment. In this paper, a vision-based control algorithm is proposed and compared with human player performances under the same conditions in realistic racing scenarios using Gran Turismo Sport (GTS), which is known as a high-fidelity realistic racing simulator. In the proposed method, the environmental information that constitutes part of the observations in conventional state-of-the-art methods is replaced with feature representations extracted from game screen images. We demonstrate that the proposed method performs expert human-level vehicle control under high-speed driving scenarios even with game screen images as high-dimensional inputs. Additionally, it outperforms the built-in AI in GTS in a time trial task, and its score places it among the top 10\% approximately 28,000 human players. Link » Ryuji Imamura · Takuma Seno · Kenta Kawamoto · Michael Spranger 🔗 - MHER: Model-based Hindsight Experience Replay (Poster) []   link »    Solving multi-goal reinforcement learning (RL) problems with sparse rewards is generally challenging. Existing approaches have utilized goal relabeling on collected experiences to alleviate issues raised from sparse rewards. However, these methods are still limited in efficiency and cannot make full use of experiences. In this paper, we propose Model-based Hindsight Experience Replay (MHER), which exploits experiences more efficiently by leveraging environmental dynamics to generate virtual achieved goals. Replacing original goals with virtual goals generated from interaction with a trained dynamics model leads to a novel relabeling method, model-based relabeling (MBR). Based on MBR, MHER performs both reinforcement learning and supervised learning for efficient policy improvement. Theoretically, we also prove the supervised part in MHER, i.e., goal-conditioned supervised learning with MBR data, optimizes a lower bound on the multi-goal RL objective. Experimental results in several point-based tasks and simulated robotics environments show that MHER achieves significantly higher sample efficiency than previous model-free and model-based multi-goal methods. Link » Yang Rui · Meng Fang · Lei Han · Yali Du · Feng Luo · Xiu Li 🔗 - On the Transferability of Deep-Q Networks (Poster) []   link »    Transfer Learning (TL) is an efficient machine learning paradigm that allows overcoming some of the hurdles that characterize the successful training of deep neural networks, ranging from long training times to the needs of large datasets. While exploiting TL is a well established and successful training practice in Supervised Learning (SL), its applicability in Deep Reinforcement Learning (DRL) is rarer.In this paper, we study the level of transferability of three different variants of Deep-Q Networks on popular DRL benchmarks as well as on a set of novel, carefully designed control tasks. Our results show that transferring neural networks in a DRL context can be particularly challenging and is a process which in most cases results in negative transfer. In the attempt of understanding why Deep-Q Networks transfer so poorly, we gain novel insights into the training dynamics that characterizes this family of algorithms. Link » Matthia Sabatelli · Pierre Geurts 🔗 - Adaptive Scheduling of Data Augmentation for Deep Reinforcement Learning (Poster) []   link »    We consider data augmentation technique to improve data efficiency and generalization performance in reinforcement learning (RL). Our empirical study on Open AI Procgen shows that the timing of when applying augmentation is critical, and to maximize test performance, an augmentation needs to be applied either during the entire RL training, or after the end of RL training. More specifically, if the regularization imposed by augmentation is helpful only in testing, it is better to procrastinate the augmentation after training than to use it during training in terms of sample and computation complexity since such an augmentation often disturbs the training process. Conversely, an augmentation providing regularization useful in training needs to be used during the whole training period to fully utilize its benefit in terms of not only generalization but also data efficiency. Based on our findings, we propose a mechanism to fully exploit a set of augmentations, which identifies an augmentation (including no augmentation) to maximize RL training performance, and then utilizes all the augmentations by network distillation to maximize test performance. Our experiment empirically justifies the proposed method compared to other automatic augmentation mechanism. Link » Byungchan Ko · Jungseul Ok 🔗 - Skill-based Meta-Reinforcement Learning (Poster) []   link »    While deep reinforcement learning methods have shown impressive results in robot learning, their sample inefficiency makes the learning of complex, long-horizon behaviors with real robot systems infeasible. To mitigate this issue, meta-reinforcement learning methods aim to enable fast learning on novel tasks by learning how to learn. Yet, the application has been limited to short-horizon tasks with dense rewards. To enable learning long-horizon behaviors, recent works have explored leveraging prior experience in the form of offline datasets without reward or task annotations. While these approaches yield improved sample efficiency, millions of interactions with environments are still required to solve complex tasks. In this work, we devise a method that enables meta-learning on long-horizon, sparse-reward tasks, allowing us to solve unseen target tasks with orders of magnitude fewer environment interactions. Our core idea is to leverage prior experience extracted from offline datasets during meta-learning. Specifically, we propose to (1) extract reusable skills and a skill prior from offline datasets, (2) meta-train a high-level policy that learns to efficiently compose learned skills into long-horizon behaviors, and (3) rapidly adapt the meta-trained policy to solve an unseen target task. Experimental results on continuous control tasks in navigation and manipulation demonstrate that the proposed method can efficiently solve long-horizon novel target tasks by combining the strengths of meta-learning and the usage of offline datasets, while prior approaches in RL, meta-RL, and multi-task RL require substantially more environment interactions to solve the tasks. Link » Taewook Nam · Shao-Hua Sun · Karl Pertsch · Sung Ju Hwang · Joseph Lim 🔗 - Introducing Symmetries to Black Box Meta Reinforcement Learning (Poster) []   link »    Meta reinforcement learning (RL) attempts to discover new RL algorithms automatically from environment interaction. In so-called black-box approaches, the policy and the learning algorithm are jointly represented by a single neural network. These methods are very flexible, but they tend to underperform in terms of generalisation to new, unseen environments. In this paper, we explore the role of symmetries in meta-generalisation. We show that a recent successful meta RL approach that meta-learns an objective for backpropagation-based learning exhibits certain symmetries (specifically the reuse of the learning rule, and invariance to input and output permutations) that are not present in typical black-box meta RL systems. We hypothesise that these symmetries can play an important role in meta-generalisation. Building off recent work in black-box supervised meta learning, we develop a black-box meta RL system that exhibits these same symmetries. We show through careful experimentation that incorporating these symmetries can lead to algorithms with a greater ability to generalise to unseen action & observation spaces, tasks, and environments. Link » Louis Kirsch · Sebastian Flennerhag · Hado van Hasselt · Abe Friesen · Junhyuk Oh · Yutian Chen 🔗 - A Graph Policy Network Approach for Volt-Var Control in Power Distribution Systems (Poster) []   link »    Volt-var control (VVC) is the problem of operating power distribution systems within healthy regimes by controlling actuators in power systems. Existing works have mostly adopted the conventional routine of representing the power systems (a graph with tree topology) as vectors to train deep reinforcement learning (RL) policies. We propose a framework that combines RL with graph neural networks and study the benefits and limitations of graph-based policy in the VVC setting. Our results show that graph-based policies converge to the same rewards asymptotically however at a slower rate when compared to vector representation counterpart. We conduct further analysis on the impact of both observations and actions: on the observation end, we examine the robustness of graph-based policy on two typical data acquisition errors in power systems, namely sensor communication failure and measurement misalignment. On the action end, we show that actuators have various impacts on the system, thus using a graph representation induced by power systems topology may not be the optimal choice. In the end, we conduct a case study to demonstrate that the choice of readout function architecture and graph augmentation can further improve training performance and robustness. Link » Xian Yeow Lee · Soumik Sarkar 🔗 - Robust Robotic Control from Pixels using Contrastive Recurrent State-Space Models (Poster) []   link »    Modeling the world can benefit robot learning by providing a rich training signal for shaping an agent's latent state space. However, learning world models in unconstrained environments over high-dimensional observation spaces such as images is challenging. One source of difficulty is the presence of irrelevant but hard-to-model background distractions, and unimportant visual details of task-relevant entities. We address this issue by learning a recurrent latent dynamics model which contrastively predicts the next observation. This simple model leads to surprisingly robust robotic control even with simultaneous camera, background, and color distractions. We outperform alternatives such as bisimulation methods which impose state-similarity measures derived from divergence in future reward or future optimal actions. We obtain state-of-the-art results on the Distracting Control Suite, a challenging benchmark for pixel-based robotic control. Link » Nitish Srivastava · Walter Talbott · Shuangfei Zhai · Joshua Susskind 🔗 - Component Transfer Learning for Deep RL Based on Abstract Representations (Poster) []   link »    In this work we investigate a specific transfer learning approach for deep reinforcement learning in the context where the internal dynamics between two tasks are the same but the visual representations differ. We learn a low-dimensional encoding of the environment, meant to capture summarizing abstractions, from which the internal dynamics and value functions are learned. Transfer is then obtained by freezing the learned internal dynamics and value functions, thus reusing the shared low-dimensional embedding space. When retraining the encoder for transfer, we make several observations: (i) in some cases, there are local minima that have small losses but a mismatching embedding space, resulting in poor task performance and (ii) in the absence of local minima, the output of the encoder converges in our experiments to the same embedding space, which leads to a fast and efficient transfer as compared to learning from scratch.The local minima are caused by the reduced degree of freedom of the optimization process caused by the frozen models. We also find that the transfer performance is heavily reliant on the base model; some base models often result in a successful transfer, whereas other base models often result in a failing transfer. Link » Geoffrey Driessel · Vincent Francois-Lavet 🔗 - ShinRL: A Library for Evaluating RL Algorithms from Theoretical and Practical Perspectives (Poster) []   link »    We present ShinRL, an open-source library specialized for the evaluation of reinforcement learning (RL) algorithms from both theoretical and practical perspectives. Existing RL libraries typically allow users to evaluate practical performances of deep RL algorithms through returns. Nevertheless, these libraries are not necessarily useful for analyzing if the algorithms perform as theoretically expected, such as if Q learning really achieves the optimal Q function. In contrast, ShinRL provides an RL environment interface that can compute metrics for delving into the behaviors of RL algorithms, such as the gap between learned and the optimal Q values and state visitation frequencies. In addition, we introduce a solver interface for evaluating both theoretically justified algorithms (e.g., dynamic programming and tabular RL) and practically effective ones (i.e., deep RL, typically with some additional extensions and regularizations) in a consistent fashion. As a case study, we show that how combining these two features of ShinRL makes it easier to analyze the behavior of deep Q learning. Furthermore, we demonstrate that ShinRL can be used to empirically validate some recent theoretical findings such as the effect of KL regularization for value iteration [Kozuno et al., 2019] and for deep Q learning [Vieillard et al., 2020a], and the robustness of entropy-regularized policies to adversarial rewards [Husain et al., 2021]. Link » Toshinori Kitamura · Ryo Yonetani 🔗 - HyAR: Addressing Discrete-Continuous Action Reinforcement Learning via Hybrid Action Representation (Poster) []   link » Discrete-continuous hybrid action space is a natural setting in many practical problems, such as robot control and game AI. However, most previous Reinforcement Learning (RL) works only demonstrate the success in controlling with either discrete or continuous action space, while seldom take into account the hybrid action space. One naive way to address hybrid action RL is to convert the hybrid action space into a unified homogeneous action space by discretization or continualization, so that conventional RL algorithms can be applied. However, this ignores the underlying structure of hybrid action space and also induces the scalability issue and additional approximation difficulties, thus leading to degenerated results. In this paper, we propose Hybrid Action Representation (HyAR) to learn a compact and decodable latent representation space for the original hybrid action space. HyAR constructs the latent space and embeds the dependence between discrete action and continuous parameter via an embedding table and conditional Variantional Auto-Encoder (VAE). To further improve the effectiveness, the action representation is trained to be semantically smooth through unsupervised environmental dynamics prediction. Finally, the agent then learns its policy with conventional DRL algorithms in the learned representation space and interacts with the environment by decoding the hybrid action embeddings to the original action space. We evaluate HyAR in a variety of environments with discrete-continuous action space. The results demonstrate the superiority of HyAR when compared with previous baselines, especially for high-dimensional action spaces. Link » Boyan Li · Hongyao Tang · YAN ZHENG · Jianye Hao · Pengyi Li · Zhaopeng Meng · LI Wang 🔗 - Behavioral Priors and Dynamics Models: Improving Performance and Domain Transfer in Offline RL (Poster) []   link » Offline Reinforcement Learning (RL) aims to extract near-optimal policies from imperfect offline data without additional environment interactions. Extracting policies from diverse offline datasets has the potential to expand the range of applicability of RL by making the training process safer, faster, and more streamlined. We investigate how to improve the performance of offline RL algorithms, its robustness to the quality of offline data, as well as its generalization capabilities. To this end, we introduce Offline Model-based RL with Adaptive Behavioral Priors (MABE). Our algorithm is based on the finding that dynamics models, which support within-domain generalization, and behavioral priors, which support cross-domain generalization, are complementary. When combined together, they substantially improve the performance and generalization of offline RL policies. In the widely studied D4RL offline RL benchmark, we find that MABE achieves higher average performance compared to prior model-free and model-based algorithms. In experiments that require cross-domain generalization, we find that MABE outperforms prior methods. Link » Catherine Cang · Aravind Rajeswaran · Pieter Abbeel · Misha Laskin 🔗 - Conservative and Adaptive Penalty for Model-Based Safe Reinforcement Learning (Poster) []   link »    Reinforcement Learning (RL) agents in the real world must satisfy safety constraints in addition to maximizing a reward objective. Model-based RL algorithms hold promise for reducing unsafe real-world actions: they may synthesize policies that obey all constraints using simulated samples from a learned model. However, imperfect models can result in real-world constraint violations even for actions that are predicted to satisfy all constraints. We propose CAP, a model-based safe RL framework that accounts for potential modeling errors by capturing model uncertainty and adaptively exploiting it to balance the reward and the cost objectives. First, CAP inflates predicted costs using an uncertainty-based penalty. Theoretically, we show that policies that satisfy this conservative cost constraint are guaranteed to also be feasible in the true environment. We further show thatthis guarantees the safety of all intermediate solutions during RL training. Further, CAP adaptively tunes this penalty during training using true cost feedback from the environment. We evaluate this conservative and adaptive penalty-based approach for model-based safe RL extensively on state and image-based environments. Our results demonstrate substantial gains in sample-efficiency while incurring fewer violations than prior safe RL algorithms. Link » Yecheng Ma · Andrew Shen · Osbert Bastani · Dinesh Jayaraman 🔗 - Math Programming based Reinforcement Learning for Multi-Echelon Inventory Management (Poster) []   link »    Reinforcement Learning has lead to considerable break-throughs in diverse areassuch as robotics, games and many others. But the application of RL to complex real world decision making problems remains limited. Many problems in Operations Management (inventory and revenue management, for example) are characterizedby large action spaces and stochastic system dynamics. These characteristicsmake the problem considerably harder to solve for existing RL methods thatrely on enumeration techniques to solve per step action problems. To resolvethese issues, we develop Programmable Actor Reinforcement Learning (PARL), apolicy iteration method that uses techniques from integer programming and sampleaverage approximation. Analytically, we show that the for a given critic, the learnedpolicy in each iteration converges to the optimal policy as the underlying samplesof the uncertainty go to infinity. Practically, we show that a properly selecteddiscretization of the underlying uncertain distribution can yield near optimal actorpolicy even with very few samples from the underlying uncertainty. We then applyour algorithm to real-world inventory management problems with complex supplychain structures and show that PARL outperforms state-of-the-art RL and inventoryoptimization methods in these settings. We find that PARL outperforms commonlyused base stock heuristic by 51.3% and RL based methods by up to 9.58% onaverage across different supply chain environments. Link » Pavithra Harsha · Ashish Jagmohan · Jayant Kalagnanam · Brian Quanz · Divya Singhvi 🔗 - Implicitly Regularized RL with Implicit Q-values (Poster) []   link »    The $Q$-function is a central quantity in many Reinforcement Learning (RL) algorithms for which RL agents behave following a (soft)-greedy policy w.r.t. to $Q$. It is a powerful tool that allows action selection without a model of the environment and even without explicitly modeling the policy. Yet, this scheme can only be used in discrete action tasks, with small numbers of actions, as the softmax cannot be computed exactly otherwise. Especially the usage of function approximation, to deal with continuous action spaces in modern actor-critic architectures, intrinsically prevents the exact computation of a softmax. We propose to alleviate this issue by parametrizing the $Q$-function \emph{implicitly}, as the sum of a log-policy and of a value function. We use the resulting parametrization to derive a practical off-policy deep RL algorithm, suitable for large action spaces, and that enforces the softmax relation between the policy and the $Q$-value. We provide a theoretical analysis of our algorithm: from an Approximate Dynamic Programming perspective, we show its equivalence to a regularized version of value iteration, accounting for both entropy and Kullback-Leibler regularization, and that enjoys beneficial error propagation results. We then evaluate our algorithm on classic control tasks, where its results compete with state-of-the-art methods. Link » Nino Vieillard · Marcin Andrychowicz · Anton Raichuk · Olivier Pietquin · Matthieu Geist 🔗 - Towards Automatic Actor-Critic Solutions to Continuous Control (Poster) []   link »    Model-free off-policy actor-critic methods are an efficient solution to complex continuous control tasks. However, these algorithms rely on a number of design tricks and hyperparameters, making their application to new domains difficult and computationally expensive. This paper creates an evolutionary approach that automatically tunes these design decisions and eliminates the RL-specific hyperparameters from the Soft Actor-Critic algorithm. Our design is sample efficient and provides practical advantages over baseline approaches, including improved exploration, generalization over multiple control frequencies, and a robust ensemble of high-performance policies. Empirically, we show that our agent outperforms well-tuned hyperparameter settings in popular benchmarks from the DeepMind Control Suite. We then apply it to less common control tasks outside of simulated robotics to find high-performance solutions with minimal compute and research effort. Link » Jake Grigsby · Jin Yong Yoo · Yanjun Qi 🔗 - Transferring Dexterous Manipulation from GPU Simulation to a Remote Real-World Trifinger (Poster) []   link »    We present a system for learning a challenging dexterous manipulation task involving moving a cube to an arbitrary 6-DoF pose with only 3-fingers trained with NVIDIA's IsaacGym simulator. We show empirical benefits, both in simulation and sim-to-real transfer, of using keypoints as opposed to position+quaternion representations for the object pose in 6-DoF for policy observations and in reward calculation to train a model-free reinforcement learning agent. By utilizing domain randomization strategies along with the keypoint representation of the pose of the manipulated object, we achieve a high success rate of 83\% on a remote TriFinger system maintained by the organizers of the Real Robot Challenge. With the aim of assisting further research in learning in-hand manipulation, we make the codebase of our system, along with trained checkpoints that come with billions of steps of experience available, at \url{https://sites.google.com/view/s2r2} Link » Arthur Allshire · Mayank Mittal · Varun Lodaya · Viktor Makoviychuk · Denys Makoviichuk · Felix Widmaier · Manuel Wuethrich · Stefan Bauer · Ankur Handa · Animesh Garg 🔗 - Hierarchical Few-Shot Imitation with Skill Transition Models (Poster) []   link »    A desirable property of autonomous agents is the ability to both solve long-horizon problems and generalize to unseen tasks. Recent advances in data-driven skill learning have shown that extracting behavioral priors from offline data can enable agents to solve challenging long-horizon tasks with reinforcement learning. However, generalization to tasks unseen during behavioral prior training remains an outstanding challenge. To this end, we present Few-shot Imitation with Skill Transition Models (FIST), an algorithm that extracts skills from offline data and utilizes them to generalize to unseen tasks given a few demonstrations at test-time. FIST learns an inverse skill dynamics model and utilizes a semi-parametric approach for imitation. We show that FIST is capable of generalizing to new tasks and substantially outperforms prior baselines in navigation experiments requiring traversing unseen parts of a large maze and 7-DoF robotic arm experiments requiring manipulating previously unseen objects in a kitchen. Link » Kourosh Hakhamaneshi · Ruihan Zhao · Albert Zhan · Pieter Abbeel · Misha Laskin 🔗 - Accelerating Robotic Reinforcement Learning via Parameterized Action Primitives (Poster) []   link »    Despite the potential of reinforcement learning (RL) for building general-purpose robotic systems, training RL agents to solve robotics tasks still remains challenging due to the difficulty of exploration in purely continuous action spaces. Addressing this problem is an active area of research with the majority of focus on improving RL methods via better optimization or more efficient exploration. An alternate but important component to consider improving is the interface of the RL algorithm with the robot. In this work, we manually specify a library of robot action primitives (RAPS), parameterized with arguments that are learned by an RL policy. These parameterized primitives are expressive, simple to implement, enable efficient exploration and can be transferred across robots, tasks and environments. We perform a thorough empirical study across challenging tasks in three distinct domains with image input and a sparse terminal reward. We find that our simple change to the action interface substantially improves both the learning efficiency and task performance irrespective of the underlying RL algorithm, significantly outperforming prior methods which learn skills from offline expert data. Link » Murtaza Dalal · Deepak Pathak · Russ Salakhutdinov 🔗 - Who Is the Strongest Enemy? Towards Optimal and Efficient Evasion Attacks in Deep RL (Poster) []   link »    Evaluating the worst-case performance of a reinforcement learning (RL) agent under the strongest/optimal adversarial perturbations on state observations (within some constraints) is crucial for understanding the robustness of RL agents. However, finding the optimal adversary is challenging, in terms of both whether we can find the optimal attack and how efficiently we can find it. Existing works on adversarial RL either use heuristics-based methods that may not find the strongest adversary, or directly train an RL-based adversary by treating the agent as a part of the environment, which can find the optimal adversary but may become intractable in a large state space. In this paper, we propose a novel attacking algorithm which has an RL-based director'' searching for the optimal policy perturbation, and anactor'' crafting state perturbations following the directions from the director (i.e. the actor executes targeted attacks). Our proposed algorithm, PA-AD, is theoretically optimal against an RL agent and significantly improves the efficiency compared with prior RL-based works in environments with large or pixel state spaces. Empirical results show that our proposed PA-AD universally outperforms state-of-the-art attacking methods in a wide range of environments. Our method can be easily applied to any RL algorithms to evaluate and improve their robustness. Link » Yanchao Sun · Ruijie Zheng · Yongyuan Liang · Furong Huang 🔗 - Automatic Curricula via Expert Demonstrations (Poster) []   link »    We propose Automatic Curricula via Expert Demonstrations (ACED), a reinforcement learning (RL) approach that combines the ideas of imitation learning and curriculum learning in order to solve challenging robotic manipulation tasks with sparse reward functions. Curriculum learning solves complicated RL tasks by introducing a sequence of auxiliary tasks with increasing difficulty, yet how to automatically design effective and generalizable curricula remains a challenging research problem. ACED extracts curricula from a small amount of expert demonstration trajectories by dividing demonstrations into sections and initializing training episodes to states sampled from different sections of demonstrations. Through moving the reset states from the end to the beginning of demonstrations as the learning agent improves its performance, ACED not only learns challenging manipulation tasks with unseen initializations and goals, but also discovers novel solutions that are distinct from the demonstrations. In addition, ACED can be naturally combined with other imitation learning methods to utilize expert demonstrations in a more efficient manner, and we show that a combination of ACED with behavior cloning allows pick-and-place tasks to be learned with as few as 1 demonstration and block stacking tasks to be learned with 20 demonstrations. Link » Siyu Dai · Andreas Hofmann · Brian Williams 🔗 - Mastering Visual Continuous Control: Improved Data-Augmented Reinforcement Learning (Poster) []   link »    We present DrQ-v2, a model-free reinforcement learning (RL) algorithm for visual continuous control. DrQ-v2 builds on DrQ, an off-policy actor-critic approach that uses data augmentation to learn directly from pixels. We introduce several improvements that yield state-of-the-art results on the DeepMind Control Suite. Notably, DrQ-v2 is able to solve complex humanoid locomotion tasks directly from pixel observations, previously unattained by model-free RL. DrQ-v2 is conceptually simple, easy to implement, and provides significantly better computational footprint compared to prior work, with the majority of tasks taking just 8 hours to train on a single GPU. Finally, we publicly release DrQ-v2's implementation to provide RL practitioners with a strong and computationally efficient baseline. Link » Denis Yarats · Rob Fergus · Alessandro Lazaric · Lerrel Pinto 🔗 - Benchmarking the Spectrum of Agent Capabilities (Poster) []   link » Evaluating the general abilities of intelligent agents requires complex simulation environments. Existing benchmarks typically evaluate only one narrow task per environment, requiring researchers to perform expensive training runs on many different environments. We introduce Crafter, an open world survival game with visual inputs that evaluates a wide range of general abilities within a single environment. Agents either learn from the provided reward signal or through intrinsic objectives and are evaluated by semantically meaningful achievements that can be unlocked during each episode, such as discovering resources and crafting tools. Consistently unlocking all achievements requires strong generalization, deep exploration, and long-term reasoning. We experimentally verify that Crafter is of appropriate difficulty to drive future research and provide baselines scores of reward agents and unsupervised agents. Furthermore, we observe sophisticated behaviors emerging from maximizing the reward signal, such as building tunnel systems, bridges, houses, and plantations. We hope that Crafter will accelerate research progress by quickly evaluating a wide spectrum of abilities. Link » Danijar Hafner 🔗 - Policy Optimization via Optimal Policy Evaluation (Poster) []   link »    Off-policy methods are the basis of a large number of effective Policy Optimization (PO) algorithms. In this setting, Importance Sampling (IS) is typically employed as a what-if analysis tool, with the goal of estimating the performance of a target policy, given samples collected with a different behavioral policy. However, in Monte Carlo simulation, IS represents a variance minimization approach. In this field, a suitable behavioral distribution is employed for sampling, allowing diminishing the variance of the estimator below the one achievable when sampling from the target distribution. In this paper, we analyze IS in these two guises, showing the connections between the two objectives. We illustrate that variance minimization can be used as a performance improvement tool, with the advantage, compared with direct off-policy learning, of implicitly enforcing a trust region. We make use of these theoretical findings to build a PO algorithm, Policy Optimization via Optimal Policy Evaluation (PO2PE), that employs variance minimization as an inner loop. Finally, we present empirical evaluations on continuous RL benchmarks, with a particular focus on the robustness to small batch sizes. Link » Alberto Maria Metelli · Samuele Meta · Marcello Restelli 🔗 - A Consciousness-Inspired Planning Agent for Model-Based Reinforcement Learning (Poster) []   link »    We present an end-to-end, model-based deep reinforcement learning agent which dynamically attends to relevant parts of its state, in order to plan and to generalize better out-of-distribution. The agent's architecture uses a set representation and a bottleneck mechanism, forcing the number of entities to which the agent attends at each planning step to be small. In experiments, we investigate the bottleneck mechanism with sets of customized environments featuring different dynamics. We consistently observe that the design allows agents to learn to plan effectively, by attending to the relevant objects, leading to better out-of-distribution generalization. Link » Mingde Zhao · Zhen Liu · Sitao Luan · Shuyuan Zhang · Doina Precup · Yoshua Bengio 🔗 - Discriminator Augmented Model-Based Reinforcement Learning (Poster) []   link »    By planning through a learned dynamics model, model-based reinforcement learning (MBRL) offers the prospect of good performance with little environment interaction. However, it is common in practice for the learned model to be inaccurate, impairing planning and leading to poor performance. This paper aims to improve planning with an importance sampling framework that accounts and corrects for discrepancy between the true and learned dynamics. This framework also motivates an alternative objective for fitting the dynamics model: to minimize the variance of value estimation during planning. We derive and implement this objective, which encourages better prediction on trajectories with larger returns. We observe empirically that our approach improves the performance of current MBRL algorithms on two stochastic control problems, and provide a theoretical basis for our method. Link » Allan Zhou · Archit Sharma · Chelsea Finn 🔗

#### Author Information

##### Pieter Abbeel (UC Berkeley & Covariant)

Pieter Abbeel is Professor and Director of the Robot Learning Lab at UC Berkeley [2008- ], Co-Director of the Berkeley AI Research (BAIR) Lab, Co-Founder of covariant.ai [2017- ], Co-Founder of Gradescope [2014- ], Advisor to OpenAI, Founding Faculty Partner AI@TheHouse venture fund, Advisor to many AI/Robotics start-ups. He works in machine learning and robotics. In particular his research focuses on making robots learn from people (apprenticeship learning), how to make robots learn through their own trial and error (reinforcement learning), and how to speed up skill acquisition through learning-to-learn (meta-learning). His robots have learned advanced helicopter aerobatics, knot-tying, basic assembly, organizing laundry, locomotion, and vision-based robotic manipulation. He has won numerous awards, including best paper awards at ICML, NIPS and ICRA, early career awards from NSF, Darpa, ONR, AFOSR, Sloan, TR35, IEEE, and the Presidential Early Career Award for Scientists and Engineers (PECASE). Pieter's work is frequently featured in the popular press, including New York Times, BBC, Bloomberg, Wall Street Journal, Wired, Forbes, Tech Review, NPR.

##### Olivia Watkins (UC Berkeley)

I'm currently exploring several areas of machine learning, including reinforcement learning, computer vision, and their applications to robotics.