Track: Reinforcement Learning, Algorithms, Applications

Wed 6 Dec. 16:20 - 16:35 PST

Oral

Off-policy evaluation for slate recommendation

Adith Swaminathan · Akshay Krishnamurthy · Alekh Agarwal · Miro Dudik · John Langford · Damien Jose · Imed Zitouni

This paper studies the evaluation of policies that recommend an ordered set of items (e.g., a ranking) based on some context---a common scenario in web search, ads, and recommendation. We build on techniques from combinatorial bandits to introduce a new practical estimator. A thorough empirical evaluation on real-world data reveals that our estimator is accurate in a variety of settings, including as a subroutine in a learning-to-rank task, where it achieves competitive performance. We derive conditions under which our estimator is unbiased---these conditions are weaker than prior heuristics for slate evaluation---and experimentally demonstrate a smaller bias than parametric approaches, even when these conditions are violated. Finally, our theory and experiments also show exponential savings in the amount of required data compared with general unbiased estimators.

Wed 6 Dec. 16:35 - 16:50 PST

Oral

Robust and Efficient Transfer Learning with Hidden Parameter Markov Decision Processes

Taylor Killian · Samuel Daulton · Finale Doshi-Velez · George Konidaris

We introduce a new formulation of the Hidden Parameter Markov Decision Process (HiP-MDP), a framework for modeling families of related tasks using low-dimensional latent embeddings. We replace the original Gaussian Process-based model with a Bayesian Neural Network. Our new framework correctly models the joint uncertainty in the latent weights and the state space and has more scalable inference, thus expanding the scope the HiP-MDP to applications with higher dimensions and more complex dynamics.

Wed 6 Dec. 16:50 - 17:05 PST

Oral

Inverse Reward Design

Dylan Hadfield-Menell · Smitha Milli · Pieter Abbeel · Stuart J Russell · Anca Dragan

Autonomous agents optimize the reward function we give them. What they don't know is how hard it is for us to design a reward function that actually captures what we want. When designing the reward, we might think of some specific scenarios (driving on clean roads), and make sure that the reward will lead to the right behavior in \emph{those} scenarios. Inevitably, agents encounter \emph{new} scenarios (snowy roads), and optimizing the reward can lead to undesired behavior (driving too fast). Our insight in this work is that reward functions are merely \emph{observations} about what the designer \emph{actually} wants, and that they should be interpreted in the context in which they were designed. We introduce \emph{Inverse Reward Design} (IRD) as the problem of inferring the true reward based on the designed reward and the training MDP. We introduce approximate methods for solving IRD problems, and use their solution to plan risk-averse behavior in test MDPs. Empirical results suggest that this approach takes a step towards alleviating negative side effects and preventing reward hacking.

Wed 6 Dec. 17:05 - 17:10 PST

Spotlight

Dynamic Safe Interruptibility for Decentralized Multi-Agent Reinforcement Learning

El Mahdi El-Mhamdi · Rachid Guerraoui · Hadrien Hendrikx · Alexandre Maurer

In reinforcement learning, agents learn by taking actions and observing their outcomes. Sometimes, it is desirable for a human operator to \textit{interrupt} an agent in order to prevent dangerous situations from happening. Yet, as part of their learning process, agents may link these interruptions, that impact their reward, to specific states and deliberately avoid them. The situation is particularly challenging in a multi-agent context because agents might not only learn from their own past interruptions, but also from those of other agents. Orseau and Armstrong~\cite{orseau2016safely} defined \emph{safe interruptibility} for one learner, but their work does not naturally extend to multi-agent systems. This paper introduces \textit{dynamic safe interruptibility}, an alternative definition more suited to decentralized learning problems, and studies this notion in two learning frameworks: \textit{joint action learners} and \textit{independent learners}. We give realistic sufficient conditions on the learning algorithm to enable dynamic safe interruptibility in the case of joint action learners, yet show that these conditions are not sufficient for independent learners. We show however that if agents can detect interruptions, it is possible to prune the observations to ensure dynamic safe interruptibility even for independent learners.

Wed 6 Dec. 17:10 - 17:15 PST

Spotlight

Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning

Christoph Dann · Tor Lattimore · Emma Brunskill

Statistical performance bounds for reinforcement learning (RL) algorithms can be critical for high-stakes applications like healthcare. This paper introduces a new framework for theoretically measuring the performance of such algorithms called Uniform-PAC, which is a strengthening of the classical Probably Approximately Correct (PAC) framework. In contrast to the PAC framework, the uniform version may be used to derive high probability regret guarantees and so forms a bridge between the two setups that has been missing in the literature. We demonstrate the benefits of the new framework for finite-state episodic MDPs with a new algorithm that is Uniform-PAC and simultaneously achieves optimal regret and PAC guarantees except for a factor of the horizon.

Wed 6 Dec. 17:15 - 17:20 PST

Spotlight

Repeated Inverse Reinforcement Learning

Kareem Amin · Nan Jiang · Satinder Singh

We introduce a novel repeated Inverse Reinforcement Learning problem: the agent has to act on behalf of a human in a sequence of tasks and wishes to minimize the number of tasks that it surprises the human by acting suboptimally with respect to how the human would have acted. Each time the human is surprised the agent is provided a demonstration of the desired behavior by the human. We formalize this problem, including how the sequence of tasks is chosen, in a few different ways and provide some foundational results.

Wed 6 Dec. 17:20 - 17:25 PST

Spotlight

Learning multiple visual domains with residual adapters

Sylvestre-Alvise Rebuffi · Hakan Bilen · Andrea Vedaldi

There is a growing interest in learning data representations that work well for many different types of problems and data. In this paper, we look in particular at the task of learning a single visual representation that can be successfully utilized in the analysis of very different types of images, from dog breeds to stop signs and digits. Inspired by recent work on learning networks that predict the parameters of another, we develop a tunable deep network architecture that, by means of adapter residual modules, can be steered on the fly to diverse visual domains. Our method achieves a high degree of parameter sharing while maintaining or even improving the accuracy of domain-specific representations. We also introduce the Visual Decathlon Challenge, a benchmark that evaluates the ability of representations to capture simultaneously ten very different visual domains and measures their ability to recognize well uniformly.

Wed 6 Dec. 17:25 - 17:30 PST

Spotlight

Natural Value Approximators: Learning when to Trust Past Estimates

Zhongwen Xu · Joseph Modayil · Hado van Hasselt · Andre Barreto · David Silver · Tom Schaul

Neural networks have a smooth initial inductive bias, such that small changes in input do not lead to large changes in output. However, in reinforcement learning domains with sparse rewards, value functions have non-smooth structure with a characteristic asymmetric discontinuity whenever rewards arrive. We propose a mechanism that learns an interpolation between a direct value estimate and a projected value estimate computed from the encountered reward and the previous estimate. This reduces the need to learn about discontinuities, and thus improves the value function approximation. Furthermore, as the interpolation is learned and state-dependent, our method can deal with heterogeneous observability. We demonstrate that this one change leads to significant improvements on multiple Atari games, when applied to the state-of-the-art A3C algorithm.

Wed 6 Dec. 17:30 - 17:35 PST

Spotlight

EX2: Exploration with Exemplar Models for Deep Reinforcement Learning

Justin Fu · John Co-Reyes · Sergey Levine

Deep reinforcement learning algorithms have been shown to learn complex tasks using highly general policy classes. However, sparse reward problems remain a significant challenge. Exploration methods based on novelty detection have been particularly successful in such settings but typically require generative or predictive models of the observations, which can be difficult to train when the observations are very high-dimensional and complex, as in the case of raw images. We propose a novelty detection algorithm for exploration that is based entirely on discriminatively trained exemplar models, where classifiers are trained to discriminate each visited state against all others. Intuitively, novel states are easier to distinguish against other states seen during training. We show that this kind of discriminative modeling corresponds to implicit density estimation, and that it can be combined with count-based exploration to produce competitive results on a range of popular benchmark tasks, including state-of-the-art results on challenging egocentric observations in the vizDoom benchmark.

Wed 6 Dec. 17:35 - 17:40 PST

Spotlight

Regret Minimization in MDPs with Options without Prior Knowledge

Ronan Fruit · Matteo Pirotta · Alessandro Lazaric · Emma Brunskill

The option framework integrates temporal abstraction into the reinforcement learning model through the introduction of macro-actions (i.e., options). Recent works leveraged on the mapping of Markov decision processes (MDPs) with options to semi-MDPs (SMDPs) and introduced SMDP-versions of exploration-exploitation algorithms (e.g., \rmaxsmdp and \ucrlsmdp) to analyze the impact of options on the learning performance. Nonetheless, the PAC-SMDP sample complexity of \rmaxsmdp can hardly be translated into equivalent PAC-MDP theoretical guarantees, while \ucrlsmdp requires prior knowledge of the parameters characterizing the distributions of the cumulative reward and duration of each option, which are hardly available in practice. In this paper we remove this limitation by combining the SMDP view together with the inner Markov structure of options into a novel algorithm whose regret performance matches \ucrlsmdp's up to an additive regret term. We show scenarios where this term is negligible and the advantage of temporal abstraction is preserved. We also report preliminary empirical result supporting the theoretical findings.

Wed 6 Dec. 17:40 - 17:45 PST

Spotlight

Successor Features for Transfer in Reinforcement Learning

Andre Barreto · Will Dabney · Remi Munos · Jonathan Hunt · Tom Schaul · David Silver · Hado van Hasselt

Transfer in reinforcement learning refers to the notion that generalization should occur not only within a task but also across tasks. We propose a transfer framework for the scenario where the reward function changes from one task to the other but the environment's dynamics remain the same. Our approach rests on two key ideas: "successor features", a value function representation that decouples the dynamics of the environment from the rewards, and "generalized policy improvement", a generalization of dynamic programming's policy improvement step that considers a set of policies rather than a single one. Put together, the two ideas lead to an approach that integrates seamlessly within the reinforcement learning framework and allows the free exchange of information between tasks. The proposed method also provides performance guarantees for the transferred policy even before any learning has taken place. We derive two theorems that set our approach in firm theoretical ground and present experiments that show that it successfully promotes transfer in practice, significantly outperforming alternative methods in a sequence of navigation tasks and in the control of a simulated two-joint robotic arm.

Wed 6 Dec. 17:45 - 17:50 PST

Spotlight

Overcoming Catastrophic Forgetting by Incremental Moment Matching

Sang-Woo Lee · Jin-Hwa Kim · Jaehyun Jun · Jung-Woo Ha · Byoung-Tak Zhang

Catastrophic forgetting is a problem of neural networks that loses the information of the first task after training the second task. Here, we propose incremental moment matching (IMM) to resolve this problem. IMM incrementally matches the moment of the posterior distribution of neural networks, which is trained for the first and the second task, respectively. To make the search space of posterior parameter smooth, the IMM procedure is complemented by various transfer learning techniques including weight transfer, L2-norm of the old and the new parameter, and a variant of dropout with the old parameter. We analyze our approach on various datasets including the MNIST, CIFAR-10, Caltech-UCSD-Birds, and Lifelog datasets. Experimental results show that IMM achieves state-of-the-art performance in a variety of datasets and can balance the information between an old and a new network.

Wed 6 Dec. 17:50 - 17:55 PST

Spotlight

Fair Clustering Through Fairlets

Flavio Chierichetti · Ravi Kumar · Silvio Lattanzi · Sergei Vassilvitskii

We study the question of fair clustering under the {\em disparate impact} doctrine, where each protected class must have approximately equal representation in every cluster. We formulate the fair clustering problem under both the $k$-center and the $k$-median objectives, and show that even with two protected classes the problem is challenging, as the optimum solution violates common conventions---for instance a point may no longer be assigned to its nearest cluster center! En route we introduce the concept of fairlets, which are minimal sets that satisfy fair representation while approximately preserving the clustering objective. We show that any fair clustering problem can be decomposed into first finding appropriate fairlets, and then using existing machinery for traditional clustering algorithms. While finding good fairlets can be NP-hard, we proceed to obtain efficient approximation algorithms based on minimum cost flow. We empirically demonstrate the \emph{price of fairness} by comparing the value of fair clustering on real-world datasets with sensitive attributes.

Wed 6 Dec. 17:55 - 18:00 PST

Spotlight

Fitting Low-Rank Tensors in Constant Time

Kohei Hayashi · Yuichi Yoshida

In this paper, we develop an algorithm that approximates the residual error of Tucker decomposition, one of the most popular tensor decomposition methods, with a provable guarantee. Given an order-$K$ tensor $X\in\mathbb{R}^{N_1\times\cdots\times N_K}$, our algorithm randomly samples a constant number $s$ of indices for each mode and creates a ``mini'' tensor $\tilde{X}\in\mathbb{R}^{s\times\cdots\times s}$, whose elements are given by the intersection of the sampled indices on $X$. Then, we show that the residual error of the Tucker decomposition of $\tilde{X}$ is sufficiently close to that of $X$ with high probability. This result implies that we can figure out how much we can fit a low-rank tensor to $X$ \emph{in constant time}, regardless of the size of $X$. This is useful for guessing the favorable rank of Tucker decomposition. Finally, we demonstrate how the sampling method works quickly and accurately using multiple real datasets.

Main Navigation

Session

Reinforcement Learning, Algorithms, Applications

Off-policy evaluation for slate recommendation

Robust and Efficient Transfer Learning with Hidden Parameter Markov Decision Processes

Inverse Reward Design

Dynamic Safe Interruptibility for Decentralized Multi-Agent Reinforcement Learning

Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning

Repeated Inverse Reinforcement Learning

Learning multiple visual domains with residual adapters

Natural Value Approximators: Learning when to Trust Past Estimates

EX2: Exploration with Exemplar Models for Deep Reinforcement Learning

Regret Minimization in MDPs with Options without Prior Knowledge

Successor Features for Transfer in Reinforcement Learning

Overcoming Catastrophic Forgetting by Incremental Moment Matching

Fair Clustering Through Fairlets

Fitting Low-Rank Tensors in Constant Time