Track: Track 2 Session 3

Wed 11 Dec. 10:05 - 10:20 PST

Oral

A neurally plausible model learns successor representations in partially observable environments

Eszter Vértes · Maneesh Sahani

Animals need to devise strategies to maximize returns while interacting with their environment based on incoming noisy sensory observations. Task-relevant states, such as the agent's location within an environment or the presence of a predator, are often not directly observable but must be inferred using available sensory information. Successor representations (SR) have been proposed as a middle-ground between model-based and model-free reinforcement learning strategies, allowing for fast value computation and rapid adaptation to changes in the reward function or goal locations. Indeed, recent studies suggest that features of neural responses are consistent with the SR framework. However, it is not clear how such representations might be learned and computed in partially observed, noisy environments. Here, we introduce a neurally plausible model using \emph{distributional successor features}, which builds on the distributed distributional code for the representation and computation of uncertainty, and which allows for efficient value function computation in partially observed environments via the successor representation. We show that distributional successor features can support reinforcement learning in noisy environments in which direct learning of successful policies is infeasible.

Wed 11 Dec. 10:20 - 10:25 PST

Spotlight

DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

Ofir Nachum · Yinlam Chow · Bo Dai · Lihong Li

In many real-world reinforcement learning applications, access to the environment is limited to a fixed dataset, instead of direct (online) interaction with the environment. When using this data for either evaluation or training of a new policy, accurate estimates of discounted stationary distribution ratios -- correction terms which quantify the likelihood that the new policy will experience a certain state-action pair normalized by the probability with which the state-action pair appears in the dataset -- can improve accuracy and performance. In this work, we propose an algorithm, DualDICE, for estimating these quantities. In contrast to previous approaches, our algorithm is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset. Furthermore, our algorithm eschews any direct use of importance weights, thus avoiding potential optimization instabilities endemic of previous methods. In addition to providing theoretical guarantees, we present an empirical study of our algorithm applied to off-policy policy evaluation and find that our algorithm significantly improves accuracy compared to existing techniques.

Wed 11 Dec. 10:25 - 10:30 PST

Spotlight

VIREL: A Variational Inference Framework for Reinforcement Learning

Mattie Fellows · Anuj Mahajan · Tim G. J. Rudner · Shimon Whiteson

Applying probabilistic models to reinforcement learning (RL) enables the uses of powerful optimisation tools such as variational inference in RL. However, existing inference frameworks and their algorithms pose significant challenges for learning optimal policies, e.g., the lack of mode capturing behaviour in pseudo-likelihood methods, difficulties learning deterministic policies in maximum entropy RL based approaches, and a lack of analysis when function approximators are used. We propose VIREL, a theoretically grounded probabilistic inference framework for RL that utilises a parametrised action-value function to summarise future dynamics of the underlying MDP, generalising existing approaches. VIREL also benefits from a mode-seeking form of KL divergence, the ability to learn deterministic optimal polices naturally from inference, and the ability to optimise value functions and policies in separate, iterative steps. In applying variational expectation-maximisation to VIREL, we thus show that the actor-critic algorithm can be reduced to expectation-maximisation, with policy improvement equivalent to an E-step and policy evaluation to an M-step. We then derive a family of actor-critic methods fromVIREL, including a scheme for adaptive exploration. Finally, we demonstrate that actor-critic algorithms from this family outperform state-of-the-art methods based on soft value functions in several domains.

Wed 11 Dec. 10:30 - 10:35 PST

Spotlight

Unsupervised Curricula for Visual Meta-Reinforcement Learning

Allan Jabri · Kyle Hsu · Abhishek Gupta · Benjamin Eysenbach · Sergey Levine · Chelsea Finn

Meta-reinforcement learning algorithms leverage experience across many tasks to learn fast and effective reinforcement learning (RL) algorithms. However, current meta-RL methods depend critically on a manually-defined distribution of meta-training tasks, and hand-crafting these task distributions is challenging and time-consuming. We develop an unsupervised algorithm for inducing an adaptive meta-training task distribution, i.e. an automatic curriculum, by modeling unsupervised interaction in a visual environment. Crucially, the task distribution is scaffolded by the meta-learner's behavior, with density-based exploration driving the evolution of the task distribution. We formulate unsupervised meta-RL with an information-theoretic objective optimized via expectation-maximization over trajectory-level latent variables. Repeating this procedure leads to iterative reorganization of behavior, allowing the task distribution to adapt as the meta-learner becomes more competent. In our experiments on vision-based navigation and manipulation domains, we show that our algorithm allows for unsupervised meta-learning of skills that transfer to downstream tasks specified by human-provided reward functions, as well as pre-training for more efficient meta-learning on user-defined task distributions. To understand the nature of the curricula, we provide visualizations and analysis of the task distributions discovered throughout the learning process, finding that the emergent tasks span a range of environment-specific exploratory and exploitative behavior.

Wed 11 Dec. 10:35 - 10:40 PST

Spotlight

Policy Continuation with Hindsight Inverse Dynamics

Hao Sun · Zhizhong Li · Xiaotong Liu · Bolei Zhou · Dahua Lin

Solving goal-oriented tasks is an important but challenging problem in reinforcement learning (RL). For such tasks, the rewards are often sparse, making it difficult to learn a policy effectively. To tackle this difficulty, we propose a new approach called Policy Continuation with Hindsight Inverse Dynamics (PCHID). This approach learns from Hindsight Inverse Dynamics based on Hindsight Experience Replay. Enabling the learning process in a self-imitated manner and thus can be trained with supervised learning. This work also extends it to multi-step settings with Policy Continuation. The proposed method is general, which can work in isolation or be combined with other on-policy and off-policy algorithms. On two multi-goal tasks GridWorld and FetchReach, PCHID significantly improves the sample efficiency as well as the final performance.

Wed 11 Dec. 10:40 - 10:45 PST

Spotlight

Learning Reward Machines for Partially Observable Reinforcement Learning

Rodrigo Toro Icarte · Ethan Waldie · Toryn Klassen · Rick Valenzano · Margarita Castro · Sheila McIlraith

Reward Machines (RMs), originally proposed for specifying problems in Reinforcement Learning (RL), provide a structured, automata-based representation of a reward function that allows an agent to decompose problems into subproblems that can be efficiently learned using off-policy learning. Here we show that RMs can be learned from experience, instead of being specified by the user, and that the resulting problem decomposition can be used to effectively solve partially observable RL problems. We pose the task of learning RMs as a discrete optimization problem where the objective is to find an RM that decomposes the problem into a set of subproblems such that the combination of their optimal memoryless policies is an optimal policy for the original problem. We show the effectiveness of this approach on three partially observable domains, where it significantly outperforms A3C, PPO, and ACER, and discuss its advantages, limitations, and broader potential.