Timezone: »
Offline reinforcement learning (RL) is a re-emerging area of study that aims to learn behaviors using only logged data, such as data from previous experiments or human demonstrations, without further environment interaction. It has the potential to make tremendous progress in a number of real-world decision-making problems where active data collection is expensive (e.g., in robotics, drug discovery, dialogue generation, recommendation systems) or unsafe/dangerous (e.g., healthcare, autonomous driving, or education). Such a paradigm promises to resolve a key challenge to bringing reinforcement learning algorithms out of constrained lab settings to the real world. The first edition of the offline RL workshop, held at NeurIPS 2020, focused on and led to algorithmic development in offline RL. This year we propose to shift the focus from algorithm design to bridging the gap between offline RL research and real-world offline RL. Our aim is to create a space for discussion between researchers and practitioners on topics of importance for enabling offline RL methods in the real world. To that end, we have revised the topics and themes of the workshop, invited new speakers working on application-focused areas, and building on the lively panel discussion last year, we have invited the panelists from last year to participate in a retrospective panel on their changing perspectives.
For details on submission please visit: https://offline-rl-neurips.github.io/2021 (Submission deadline: October 6, Anywhere on Earth)
Speakers:
Aviv Tamar (Technion - Israel Inst. of Technology)
Angela Schoellig (University of Toronto)
Barbara Engelhardt (Princeton University)
Sham Kakade (University of Washington/Microsoft)
Minmin Chen (Google)
Philip S. Thomas (UMass Amherst)
Tue 9:00 a.m. - 9:10 a.m.
|
Opening Remarks
SlidesLive Video » |
Rishabh Agarwal · Aviral Kumar 🔗 |
Tue 9:10 a.m. - 9:40 a.m.
|
Learning to Explore From Data
(
Talk
)
SlidesLive Video » |
Aviv Tamar 🔗 |
Tue 9:40 a.m. - 9:45 a.m.
|
Q&A for Aviv Tamar
(
Q&A
)
|
Aviv Tamar 🔗 |
Tue 9:45 a.m. - 9:55 a.m.
|
Contributed Talk 1: What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
(
Talk
)
SlidesLive Video » |
Ajay Mandlekar 🔗 |
Tue 10:00 a.m. - 10:10 a.m.
|
Contributed Talk 2: What Would the Expert do?: Causal Imitation Learning
(
Talk
)
SlidesLive Video » |
Gokul Swamy 🔗 |
Tue 10:15 a.m. - 10:25 a.m.
|
Contributed Talk 3: Offline Reinforcement Learning: Fundamental Barriers for Value Function Approximation
(
Talk
)
SlidesLive Video » |
Yunzong Xu · Akshay Krishnamurthy · David Simchi-Levi 🔗 |
Tue 10:30 a.m. - 10:40 a.m.
|
Contributed Talk 4: PulseRL: Enabling Offline Reinforcement Learning for Digital Marketing Systems via Conservative Q-Learning
(
Talk
)
SlidesLive Video » |
Luckeciano Carvalho Melo 🔗 |
Tue 10:40 a.m. - 11:45 a.m.
|
Poster Session 1
(
Poster Session
)
[ protected link dropped ] |
🔗 |
Tue 11:45 a.m. - 11:46 a.m.
|
Speaker Intro
(
Speaker Introduction
)
|
Rishabh Agarwal · Aviral Kumar 🔗 |
Tue 11:46 a.m. - 12:16 p.m.
|
Offline RL for Robotics
(
Talk
)
SlidesLive Video » |
Angela Schoellig 🔗 |
Tue 12:16 p.m. - 12:21 p.m.
|
Q&A for Angela Schoellig
(
Q&A
)
|
🔗 |
Tue 12:21 p.m. - 12:22 p.m.
|
Speaker Intro
(
Live short intro
)
|
Rishabh Agarwal · Aviral Kumar 🔗 |
Tue 12:22 p.m. - 12:52 p.m.
|
Generalization theory in Offline RL
(
Talk
)
SlidesLive Video » |
Sham Kakade 🔗 |
Tue 12:52 p.m. - 12:57 p.m.
|
Q&A for Sham Kakade
(
Q&A
)
|
Sham Kakade 🔗 |
Tue 1:00 p.m. - 2:00 p.m.
|
Invited Speaker Panel
(
Discussion Panel
)
SlidesLive Video » |
Sham Kakade · Minmin Chen · Philip Thomas · Angela Schoellig · Barbara Engelhardt · Doina Precup · George Tucker 🔗 |
Tue 2:00 p.m. - 3:00 p.m.
|
Retrospective Panel
(
Discussion Panel
)
SlidesLive Video » |
Sergey Levine · Nando de Freitas · Emma Brunskill · Finale Doshi-Velez · Nan Jiang · Rishabh Agarwal 🔗 |
Tue 3:00 p.m. - 3:01 p.m.
|
Speaker Intro
|
Aviral Kumar · George Tucker 🔗 |
Tue 3:01 p.m. - 3:31 p.m.
|
Offline RL for recommendation systems
(
Talk
)
SlidesLive Video » |
Minmin Chen 🔗 |
Tue 3:31 p.m. - 3:36 p.m.
|
Q&A for Minmin Chen
(
Q&A
)
|
Minmin Chen 🔗 |
Tue 4:06 p.m. - 4:07 p.m.
|
Speaker Intro
|
Aviral Kumar · George Tucker 🔗 |
Tue 4:07 p.m. - 4:37 p.m.
|
Offline Reinforcement Learning for Hospital Patients When Every Patient is Different
(
Talk
)
SlidesLive Video » |
Barbara Engelhardt 🔗 |
Tue 4:37 p.m. - 4:42 p.m.
|
Q&A for Barbara Engelhardt
(
Q&A
)
|
🔗 |
Tue 4:42 p.m. - 4:43 p.m.
|
Speaker Intro
(
Introduction
)
|
🔗 |
Tue 4:43 p.m. - 5:13 p.m.
|
Advances in (High-Confidence) Off-Policy Evaluation
(
Talk
)
SlidesLive Video » |
Philip Thomas 🔗 |
Tue 5:13 p.m. - 5:19 p.m.
|
Q&A for Philip Thomas
(
Q&A
)
|
Philip Thomas 🔗 |
Tue 5:19 p.m. - 5:20 p.m.
|
Closing Remarks & Poster Session
(
Closing Remarks
)
|
🔗 |
Tue 5:20 p.m. - 6:20 p.m.
|
Poster Session 2
(
Poster Session
)
[ protected link dropped ] |
🔗 |
-
|
Offline Reinforcement Learning with Soft Behavior Regularization
(
Poster
)
Most prior approaches to offline reinforcement learning (RL) utilize \textit{behavior regularization}, typically augmenting existing off-policy actor critic algorithms with a penalty measuring divergence between the policy and the offline data. However, these approaches lack guaranteed performance improvement over the behavior policy. In this work, we start from the performance difference between the learned policy and the behavior policy, we derive a new policy learning objective that can be used in the offline setting, which corresponds to the advantage function value of the behavior policy, multiplying by a state-marginal density ratio. We propose a practical way to compute the density ratio and demonstrate its equivalence to a state-dependent behavior regularization. Unlike state-independent regularization used in prior approaches, this \textit{soft} regularization allows more freedom of policy deviation at high confidence states, leading to better performance and stability. We thus term our resulting algorithm Soft Behavior-regularized Actor Critic (SBAC). Our experimental results show that SBAC matches or outperforms the state-of-the-art on a set of continuous control locomotion and manipulation tasks. |
Haoran Xu · Xianyuan Zhan · Li Jianxiong · Honglei Yin 🔗 |
-
|
Instance-dependent Offline Reinforcement Learning: From tabular RL to linear MDPs
(
Poster
)
We study the \emph{offline reinforcement learning} (offline RL) problem, where the goal is to learn a reward-maximizing policy in an unknown \emph{Markov Decision Process} (MDP) using the data coming from a policy $\mu$. In particular, we consider the sample complexity problems of offline RL for the finite horizon MDPs. Prior works derive the information-theoretical lower bounds based on different data-coverage assumptions and their upper bounds are expressed by the covering coefficients which lack the explicit characterization of system quantities. In this work, we analyze the \emph{Adaptive Pessimistic Value Iteration} (APVI) algorithm and derive the suboptimality upper bound that nearly matches equation (1). We also prove an information-theoretical lower bound to show this quantity is required under the weak assumption that $d^\mu_h(s_h,a_h)>0$ if $d^{\pi^\star}_h(s_h,a_h)>0$. Here $\pi^\star$ is a optimal policy, $\mu$ is the behavior policy and $d(s_h,a_h)$ is the marginal state-action probability. We call this adaptive bound the \emph{intrinsic offline reinforcement learning bound} since it directly implies all the existing optimal results: minimax rate under uniform data-coverage assumption, horizon-free setting, single policy concentrability, and the tight problem-dependent results. Later, we extend the result to the \emph{assumption-free} regime (where we make no assumption on $
\mu$) and obtain the assumption-free intrinsic bound. Due to its generic form, we believe the intrinsic bound could help illuminate what makes a specific problem hard and reveal the fundamental challenges in offline RL.
|
Ming Yin · Yu-Xiang Wang 🔗 |
-
|
DCUR: Data Curriculum for Teaching via Samples with Reinforcement Learning
(
Poster
)
Deep reinforcement learning (RL) has shown great empirical successes, but suffers from brittleness and sample inefficiency. A potential remedy is to use a previously-trained policy as a source of supervision. In this work, we refer to these policies as teachers and study how to transfer their expertise to new student policies by focusing on data usage. We propose a framework, Data CUrriculum for Reinforcement learning (DCUR), which first trains teachers using online deep RL, and stores the logged environment interaction history. Then, students learn by running either offline RL or by using teacher data in combination with a small amount of self-generated data. DCUR’s central idea involves defining a class of data curricula which, as a function of training time, limits the student to sampling from a fixed subset of the full teacher data. We test teachers and students using state-of-the-art deep RL algorithms across a variety of data curricula. Results suggest that the choice of data curricula significantly impacts student learning, and that it is beneficial to limit the data during early training stages while gradually letting the data availability grow over time. We identify when the student can learn offline and match teacher performance without relying on specialized offline RL algorithms. Furthermore, we show that collecting a small fraction of online data provides complementary benefits with the data curriculum. Supplementary material is available at https://sites.google.com/view/anon-dcur/. |
Daniel Seita · Abhinav Gopal · Mandi Zhao · John Canny 🔗 |
-
|
What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
(
Poster
)
Imitating human demonstrations is a promising approach to endow robots with various manipulation capabilities. While recent advances have been made in imitation learning and batch (offline) reinforcement learning, a lack of open-source human datasets and reproducible learning methods make assessing the state of the field difficult. In this paper, we conduct an extensive study of six offline learning algorithms for robot manipulation on five simulated and three real-world multi-stage manipulation tasks of varying complexity, and with datasets of varying quality. Our study analyzes the most critical challenges when learning from offline human data for manipulation. Based on the study, we derive a series of lessons including the sensitivity to different algorithmic design choices, the dependence on the quality of the demonstrations, and the variability based on the stopping criteria due to the different objectives in training and evaluation. We also highlight opportunities for learning from human datasets, such as the ability to learn proficient policies on challenging, multi-stage tasks beyond the scope of current reinforcement learning methods, and the ability to easily scale to natural, real-world manipulation scenarios where only raw sensory signals are available. Upon acceptance, we will open-source our datasets and all algorithm implementations to facilitate future research and fair comparisons in learning from human demonstration data. Additional results and videos at https://sites.google.com/view/offline-demo-study |
Ajay Mandlekar · Danfei Xu · Josiah Wong · Chen Wang · Li Fei-Fei · Silvio Savarese · Yuke Zhu · Roberto Martín-Martín 🔗 |
-
|
TiKick: Toward Playing Multi-agent Football Full Games from Single-agent Demonstrations
(
Poster
)
Deep reinforcement learning (DRL) has achieved super-human performance on complex video games (e.g., StarCraft II and Dota II). However, current DRL systems still suffer from challenges of multi-agent coordination, sparse rewards, stochastic environments, etc. In seeking to address these challenges, we employ a football video game, e.g., Google Research Football (GRF), as our testbed and develop an end-to-end learning-based AI system (denoted as TiKick) to complete this challenging task. In this work, we first generated a large replay dataset from the self-playing of single-agent experts, which are obtained from league training. We then developed a new offline algorithm to learn a powerful multi-agent AI from the fixed single-agent dataset. To the best of our knowledge, Tikick is the first learning-based AI system that can take over the multi-agent Google Research Football full game, while previous work could either control a single agent or experiment on toy academic scenarios. Extensive experiments further show that our pre-trained model can accelerate the training process of the modern multi-agent algorithm and our method achieves state-of-the-art performances on various academic scenarios. |
Shiyu Huang · Wenze Chen · Longfei Zhang · Shizhen Xu · Ziyang Li · Fengming Zhu · Deheng Ye · Ting Chen · Jun Zhu 🔗 |
-
|
d3rlpy: An Offline Deep Reinforcement Learning Library
(
Poster
)
In this paper, we introduce d3rlpy, an open-sourced offline deep reinforcement learning (RL) library for Python. d3rlpy supports a number of offline deep RL algorithms as well as online algorithms via a user-friendly API. To assist deep RL research and development projects, d3rlpy provides practical and unique features such as data collection, exporting policies for deployment, preprocessing and postprocessing, distributional Q-functions, multi-step learning and a convenient command-line interface. Furthermore, d3rlpy additionally provides a novel graphical interface that enables users to train offline RL algorithms without coding programs. Lastly, the implemented algorithms are benchmarked with D4RL datasets to ensure the implementation quality. |
Takuma Seno · Michita Imai 🔗 |
-
|
PulseRL: Enabling Offline Reinforcement Learning for Digital Marketing Systems via Conservative Q-Learning
(
Poster
)
Digital Marketing Systems (DMS) are the primary point of contact between a digital business and its customers. In this context, the communication channel optimization problem poses a precious and still open challenge for DMS. Due to its interactive nature, Reinforcement Learning (RL) appears as a promising formulation for this problem. However, the standard RL setting learns from interacting with the environment, which is costly and dangerous for production systems. Furthermore, it also fails to learn from historical interactions due to the distributional shift between the collection and learning policies. For this matter, we present PulseRL, an offline RL-based production system for communication channel optimization built upon the Conservative Q-Learning (CQL) Framework. PulseRL architecture comprises the whole engineering pipeline (data processing, training, deployment, and monitoring), scaling to handle millions of users. Using CQL, PulseRL learns from historical logs, and its learning objective reduces the shift problem by mitigating the overestimation bias from out-of-distribution actions. We conducted experiments in a real-world DMS. Results show that PulseRL surpasses RL baselines with a significant margin in the online evaluation. They also validate the theoretical properties of CQL in a complex scenario with high sampling error and non-linear function approximation. |
Luckeciano Carvalho Melo · Luana G B Martins · Bryan Lincoln de Oliveira · Bruno Brandão · Douglas Winston Soares · Telma Lima 🔗 |
-
|
Latent Geodesics of Model Dynamics for Offline Reinforcement Learning
(
Poster
)
Model-based offline reinforcement learning approaches generally rely on bounds of model error. While contemporary methods achieve such bounds through an ensemble of models, we propose to estimate them using a data-driven latent metric. Particularly, we build upon recent advances in Riemannian geometry of generative models to construct a latent metric of an encoder-decoder based forward model. Our proposed metric measures both the quality of out of distribution samples as well as the discrepancy of examples in the data. We show that our metric can be viewed as a combination of two metrics, one relating to proximity and the other to epistemic uncertainty. Finally, we leverage our metric in a pessimistic model-based framework, showing a significant improvement upon contemporary model-based offline reinforcement learning benchmarks. |
Guy Tennenholtz · Nir Baram · Shie Mannor 🔗 |
-
|
Domain Knowledge Guided Offline Q Learning
(
Poster
)
Offline reinforcement learning (RL) is a promising method for applications where direct exploration is not possible but a decent initial model is expected for the online stage. In practice, offline RL can underperform because of overestimation attributed to distributional shift between the training data and the learned policy. A common approach to mitigating this issue is to constrain the learned policies so that they remain close to the fixed batch of interactions. This method is typically used without considering the application context. However, domain knowledge is available in many real-world cases and may be utilized to effectively handle the issue of out-of-distribution actions. Incorporating domain knowledge in training avoids additional function approximation to estimate the behavior policy and results in easy-to-interpret policies. To encourage the adoption of offline RL in practical applications, we propose the Domain Knowledge guided Q learning (DKQ). We show that DKQ is a conservative approach, where the unique fixed point still exists and is upper bounded by the standard optimal Q function. DKQ also leads to lower chance of overestimation. In addition, we demonstrate the benefit of DKQ empirically via a novel, real-world case study - guided family tree building, which appears to be the first application of offline RL in genealogy. The results show that guided by proper domain knowledge, DKQ can achieve similar offline performance as standard Q learning and is better aligned with the behavior policy revealed from the data, indicating a lower risk of overestimation on unseen actions. Further, we demonstrate the efficiency and flexibility of DKQ with a classical control problem. |
Xiaoxuan Zhang · Sijia Zhang · Yen-Yun Yu 🔗 |
-
|
Understanding the Effects of Dataset Characteristics on Offline Reinforcement Learning
(
Poster
)
In real world, affecting the environment by a weak policy can be expensive or very risky, therefore hampers real world applications of reinforcement learning. Offline Reinforcement Learning (RL) can learn policies from a given dataset without interacting with the environment. However, the dataset is the only source of information for an Offline RL algorithm and determines the performance of the learned policy. We still lack studies on how dataset characteristics influence different Offline RL algorithms. Therefore, we conducted a comprehensive empirical analysis of how dataset characteristics effect the performance of Offline RL algorithms for discrete action environments. A dataset is characterized by two metrics: (1) the Trajectory Quality (TQ) measured by the average dataset return and (2) the State-Action Coverage (SACo) measured by the number of unique state-action pairs. We found that variants of the off-policy Deep Q-Network family require datasets with high SACo to perform well. Algorithms that constrain the learned policy towards the given dataset perform well for datasets with high TQ or SACo. For datasets with high TQ, Behavior Cloning outperforms or performs similarly to the best Offline RL algorithms. |
Kajetan Schweighofer · Markus Hofmarcher · Marius-Constantin Dinu · Philipp Renz · Angela Bitto · Vihang Patil · Sepp Hochreiter 🔗 |
-
|
Unsupervised Learning of Temporal Abstractions using Slot-based Transformers
(
Poster
)
The discovery of reusable sub-routines simplifies decision-making and planning in complex reinforcement learning problems. Previous approaches propose to learn such temporal abstractions in a purely unsupervised fashion through observing state-action trajectories gathered from executing a policy. However, a current limitation is that they process each trajectory in an entirely sequential manner, which prevents them from revising earlier decisions about sub-routine boundary points in light of new incoming information. In this work we propose SloTTAr, a fully parallel approach that integrates sequence processing Transformers with a Slot Attention module for learning about sub-routines in an unsupervised fashion. We demonstrate how SloTTAr is capable of outperforming strong baselines in terms of boundary point discovery, while being up to 30x faster on existing benchmarks. |
Anand Gopalakrishnan · Kazuki Irie · Jürgen Schmidhuber · Sjoerd van Steenkiste 🔗 |
-
|
Counter-Strike Deathmatch with Large-Scale Behavioural Cloning
(
Poster
)
This paper describes an AI agent that plays the modern first-person-shooter (FPS) video game `Counter-Strike; Global Offensive' (CSGO) from pixel input. The agent, a deep neural network, matches the performance of the medium difficulty built-in AI on the deathmatch game mode whilst adopting a humanlike play style. Previous research has mostly focused on games with convenient APIs and low-resolution graphics, allowing them to be run cheaply at scale. This is not the case for CSGO, with system requirements 100$\times$ that of previously studied FPS games. This limits the quantity of on-policy data that can be generated, precluding many reinforcement learning algorithms. Our solution uses behavioural cloning — training on a large noisy dataset scraped from human play on online servers (5.5 million frames or 95 hours), and smaller datasets of clean expert demonstrations. This scale is an order of magnitude larger than prior work on imitation learning in FPS games. To introduce this challenging environment to the AI community, we open source code and datasets.
|
Tim Pearce · Jun Zhu 🔗 |
-
|
Modern Hopfield Networks for Return Decomposition for Delayed Rewards
(
Poster
)
Delayed rewards, which are separated from their causative actions by irrelevant actions, hamper learning in reinforcement learning (RL). Especially real world problems often contain such delayed and sparse rewards. Recently, return decomposition for delayed rewards (RUDDER) employed pattern recognition to remove or reduce delay in rewards, which dramatically simplifies the learning task of the underlying RL method. RUDDER was realized using a long short-term memory (LSTM). The LSTM was trained to identify important state-action pair patterns, responsible for the return. Reward was then redistributed to these important state-action pairs. However, training the LSTM is often difficult and requires a large number of episodes. In this work, we replace the LSTM with the recently proposed continuous modern Hopfield networks (MHN) and introduce Hopfield-RUDDER. MHN are powerful trainable associative memories with large storage capacity. They require only few training samples and excel at identifying and recognizing patterns. We use this property of MHN to identify important state-action pairs that are associated with low or high return episodes and directly redistribute reward to them. However, in partially observable environments, Hopfield-RUDDER requires additional information about the history of state-action pairs. Therefore, we evaluate several methods for compressing history and introduce reset-max history, a lightweight history compression using the max-operator in combination with a reset gate. We experimentally show that Hopfield-RUDDER is able to outperform LSTM-based RUDDER on various 1D environments with small numbers of episodes. Finally, we show in preliminary experiments that Hopfield-RUDDER scales to highly complex environments with the Minecraft ObtainDiamond task from the MineRL NeurIPS challenge. |
Michael Widrich · Markus Hofmarcher · Vihang Patil · Angela Bitto · Sepp Hochreiter 🔗 |
-
|
Pessimistic Model-based Offline Reinforcement Learning under Partial Coverage
(
Poster
)
We study model-based offline Reinforcement Learning with general function approximation without a full coverage assumption on the offline data distribution. We present an algorithm named Constrained Pessimistic Policy Optimization (CPPO) which leverages a general function class and uses a constraint over the models to encode pessimism. Under the assumption that the ground truth model belongs to our function class (i.e., realizability in the function class), CPPO has a PAC guarantee with offline data only providing partial coverage, i.e., it can learn a policy that competes against any policy covered by the offline data. We then demonstrate that this algorithmic framework can be applied to many specialized Markov Decision Processes where the additional structural assumptions can further refine the concept of partial coverage. Two notable examples are: (1) low- rank MDP with representation learning where the partial coverage condition is defined using a relative condition number measured by the unknown ground truth feature representation; (2) factored MDP where the partial coverage condition is defined using density-ratio based concentrability coefficients associated with individual factors. |
Masatoshi Uehara · Wen Sun 🔗 |
-
|
Importance of Representation Learning for Off-Policy Fitted Q-Evaluation
(
Poster
)
The goal of offline policy evaluation (OPE) is to evaluate target policies based on logged data with a possibly much different distribution. One of the most popular empirical approaches to OPE is fitted Q-evaluation (FQE). With linear function approximation, several works have found that FQE (and other OPE methods) exhibit exponential error amplification in the problem horizon, except under very strong assumptions. Given the empirical success of deep FQE, in this work we examine the effect of implicit regularization through deep architectures and loss functions on the divergence and performance of FQE. We find that divergence does occur with simple feed-forward architectures, but can be mitigated using various architectures and algorithmic techniques, such as ResNet architectures, learning a shared representation between multiple target policies, and hypermodels. Our results suggest interesting directions for future work, including analyzing the effect of architecture on stability of fixed-point updates which are ubiquitous in modern reinforcement learning. |
Xian Wu · Nevena Lazic · Dong Yin · Cosmin Paduraru 🔗 |
-
|
Offline Contextual Bandits for Wireless Network Optimization
(
Poster
)
The explosion in mobile data traffic together with the ever-increasing expectations for higher quality of service call for the development of new AI algorithms for wireless network optimization. In this paper, we investigate how to learn policies that can automatically adjust the configuration parameters of every cell in the network in response to the changes in the user demand. Our solution combines existent methods for offline learning and adapts them in a principled way to overcome crucial challenges arising in this context. Empirical results suggest that our proposed method will achieve important performance gains when deployed in the real network while satisfying practical constraints on computational efficiency. |
Miguel Suau 🔗 |
-
|
Robust On-Policy Data Collection for Data-Efficient Policy Evaluation
(
Poster
)
This paper considers how to complement offline reinforcement learning (RL) data with additional data collection for the task of policy evaluation. In policy evaluation, the task is to estimate the expected return of an evaluation policy on an environment of interest. Prior work on offline policy evaluation typically only considers a static dataset. We consider a setting where we can collect a small amount of additional data to combine with a potentially larger offline RL dataset. We show that simply running the evaluation policy – on-policy data collection – is sub-optimal for this setting. We then introduce two new data collection strategies for policy evaluation, both of which consider previously collected data when collecting future data so as to reduce distribution shift (or sampling error) in the entire dataset collected. Our empirical results show that compared to on-policy sampling, our strategies produce data with lower sampling error and generally lead to lower mean-squared error in policy evaluation for any total dataset size. We also show that these strategies can start from initial off-policy data, collect additional data, and then use both the initial and new data to produce low mean-squared error policy evaluation without using off-policy corrections. |
Rujie Zhong · Josiah Hanna · Lukas Schäfer · Stefano Albrecht 🔗 |
-
|
Doubly Pessimistic Algorithms for Strictly Safe Off-Policy Optimization
(
Poster
)
Safety in reinforcement learning (RL) has become increasingly important in recent years. Yet, many of existing solutions fail to strictly avoid choosing unsafe actions, which may lead to catastrophic results in safety-critical systems. In this paper, we study offline RL in the presence of safety requirements: from a dataset collected a priori and without direct access to the true environment, learn an optimal policy that is guaranteed to respect the safety constraints. We first address this problem by modeling the safety requirement as an unknown cost function of states and actions, whose expected value with respect to the policy must fall below a certain threshold. We then present an algorithm in the context of finite-horizon Markov decision processes (MDPs), termed Safe-DPVI that performs in a doubly pessimistic manner when 1) it constructs a conservative set of safe policies; and 2) when it selects a good policy from that conservative set. Without assuming the sufficient coverage of the dataset or any structure for the underlying MDPs, we establish a data-dependent upper bound on the suboptimality gap of the \emph{safe} policy Safe-DPVI returns. We then specialize our results to linear MDPs with appropriate assumptions on dataset being well-explored. Both data-dependent and specialized upper bounds nearly match that of state-of-the-art unsafe offline RL algorithms, with an additional multiplicative factor $\frac{\sum_{h=1}^H\alpha_{h}}{H}$, where $\alpha_h$ characterizes the safety constraint at time-step $h$. We further present numerical simulations that corroborate our theoretical findings.
|
Sanae Amani · Lin Yang 🔗 |
-
|
OFFLINE RL WITH RESOURCE CONSTRAINED ONLINE DEPLOYMENT
(
Poster
)
Offline reinforcement learning is used to train policies in scenarios where real-time access to the environment is expensive or impossible. As a natural consequence of these harsh conditions, an agent may lack the resources to fully observe the online environment before taking an action. We dub this situation the \newterm{resource-constrained} setting. This leads to situations where the offline dataset (available for training) can contain fully processed features (using powerful language models, image models, complex sensors, etc.) which are not available when actions are actually taken online. This disconnect leads to an interesting and unexplored problem in offline RL: \textbf{Is it possible to use a richly processed offline dataset to train a policy which has access to fewer features in the online environment?} In this work, we introduce and formalize this novel resource-constrained problem setting. We highlight the performance gap between policies trained using the full offline dataset and policies trained using limited features. We address this performance gap with a \newterm{policy transfer algorithm} which first trains a teacher agent using the offline dataset where features are fully available, and then transfers this knowledge to a student agent that only uses the resource-constrained features. To better capture the challenge of this setting, we propose a data collection procedure: Resource Constrained-Datasets for RL (RC-D4RL). We evaluate our transfer algorithm on RC-D4RL and the popular D4RL benchmarks and observe consistent improvement over the baseline (TD3+BC without transfer). |
Jayanth Reddy Regatti · Aniket Anand Deshmukh · Young Jung · Abhishek Gupta · Urun Dogan 🔗 |
-
|
Personalization for Web-based Services using Offline Reinforcement Learning
(
Poster
)
Large-scale Web-based services present opportunities for improving UI policies based on observed user interactions. We investigate both the sequential and non-sequential formulations, highlighting their benefits and drawbacks. In the sequential setting, we address challenges of learning such policies through model-free offline Reinforcement Learning (RL) with off-policy training. Deployed in a production system for user authentication in a major social network, it significantly improves long-term objectives. We articulate practical challenges, compare several ML techniques, provide insights on training and evaluation of RL models, and discuss generalizations. |
Pavlos A Apostolopoulos · Zehui Wang · Hanson Wang · Chad Zhou · Kittipat Virochsiri · Norm Zhou · Igor Markov 🔗 |
-
|
Offline Reinforcement Learning with Implicit Q-Learning
(
Poster
)
Offline reinforcement learning requires reconciling two conflicting aims: learning a policy that improves over the behavior policy that collected the dataset, while at the same time minimizing the deviation from the behavior policy so as to avoid errors due to distributional shift. This tradeoff is critical, because most current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy, and therefore need to either constrain these actions to be in-distribution, or else regularize their values. We propose a new offline RL method that never needs to evaluate actions outside of the dataset, but still enables the learned policy to improve substantially over the best behavior in the data through generalization. The main insight in our work is that, instead of evaluating unseen actions from the latest policy, we can approximate the policy improvement step implicitly by treating the state value function as a random variable, with randomness determined by the action (while still integrating over the dynamics to avoid excessive optimism), and then taking a state conditional upper expectile of this random variable to estimate the value of the best actions in that state. This leverages the generalization capacity of the function approximator to estimate the value of the best available action at a given state without ever directly querying a Q-function with this unseen action. Our algorithm alternates between fitting this upper expectile value function and backing it up into a Q-function, without any explicit policy. Then, we extract the policy via advantage-weighted behavioral cloning, which also avoids querying out-of-sample actions. We dub our method implicit Q-learning (IQL). IQL is easy to implement, computationally efficient, and only requires fitting an additional critic with an asymmetric L2 loss. |
Ilya Kostrikov · Ashvin Nair · Sergey Levine 🔗 |
-
|
Pessimistic Model Selection for Offline Deep Reinforcement Learning
(
Poster
)
Deep Reinforcement Learning (DRL) has demonstrated great potentials in solving sequential decision making problems in many applications. Despite its promising performance, practical gaps exist when deploying DRL in real-world scenarios. One main barrier is the over-fitting issue that leads to poor generalizability of the policy learned by DRL. |
Huck Yang · Yifan Cui · Pin-Yu Chen 🔗 |
-
|
BATS: Best Action Trajectory Stitching
(
Poster
)
The problem of offline reinforcement learning focuses on learning a good policy from a log of environment interactions. Past efforts for developing algorithms in this area have revolved around introducing constraints to online reinforcement algorithms to ensure the actions of the learned policy are constrained to the logged data. In this work, we explore an alternative approach by planning on the fixed dataset directly. Specifically, we introduce an algorithm which forms a tabular Markov Decision Process (MDP) over the logged data by adding new transitions to the dataset. We do this by using learned dynamics models to plan short trajectories between states. Since exact value iteration can be performed on this constructed MDP, it becomes easy to identify which trajectories are advantageous to add to the MDP. Crucially, since most transitions in this MDP come from the logged data, trajectories from the MDP can be rolled out for long periods with confidence. We prove that this property allows one to make upper and lower bounds on the value function up to appropriate distance metrics. Finally, we demonstrate empirically how algorithms that uniformly constrain the learned policy to the entire dataset can result in unwanted behavior, and we show an example in which simply behavior cloning the optimal policy of the MDP created by our algorithm avoids this problem. |
Ian Char · Viraj Mehta · Adam Villaflor · John Dolan · Jeff Schneider 🔗 |
-
|
Single-Shot Pruning for Offline Reinforcement Learning
(
Poster
)
Deep Reinforcement Learning (RL) is a powerful framework for solving complex real-world problems. Large neural networks employed in the framework are traditionally associated with better generalization capabilities, but their increased size entails the drawbacks of extensive training duration, substantial hardware resources, and longer inference times. One way to tackle this problem is to prune neural networks leaving only the necessary parameters. State-of-the-art concurrent pruning techniques for imposing sparsity perform demonstrably well in applications where data-distributions are fixed. However, they have not yet been substantially explored in the context of RL. We close the gap between RL and single-shot pruning techniques and present a general pruning approach to the Offline RL. We leverage a fixed dataset to prune neural networks before the start of RL training. We then run experiments varying the network sparsity level and evaluating the validity of pruning at initialization techniques in continuous control tasks. Our results show that with 95% of the network weights pruned, Offline-RL algorithms can still retain performance in the majority of our experiments. To the best of our knowledge no prior work utilizing pruning in RL retained performance at such high levels of sparsity. Moreover, pruning at initialization techniques can be easily integrated into any existing Offline-RL algorithms without changing the learning objective. |
Samin Yeasar Arnob · · Sergey Plis · Doina Precup 🔗 |
-
|
Offline neural contextual bandits: Pessimism, Optimization and Generalization
(
Poster
)
Offline policy learning (OPL) leverages existing data collected a priori for policy optimization without any active exploration. Despite the prevalence and recent interest in this problem, its theoretical and algorithmic foundations in function approximation settings remain under-developed. In this paper, we consider this problem on the axes of distributional shift, optimization, and generalization in offline contextual bandits with neural networks. In particular, we propose a provably efficient offline contextual bandit with neural network function approximation that does not require any functional assumption on the reward. We show that our method provably generalizes over unseen contexts under a milder condition for distributional shift than the existing OPL works. Notably, unlike any other OPL method, our method learns from the offline data in an online manner using stochastic gradient descent, allowing us to leverage the benefits of online learning into an offline setting. Moreover, we show that our method is more computationally efficient and has a better dependence on the effective dimension of the neural network than an online counterpart. Finally, we demonstrate the empirical effectiveness of our method in a range of synthetic and real-world OPL problems |
Thanh Nguyen-Tang · Sunil Gupta · A. Tuan Nguyen · Svetha Venkatesh 🔗 |
-
|
Improving Zero-shot Generalization in Offline Reinforcement Learning using Generalized Similarity Functions
(
Poster
)
Reinforcement learning (RL) agents are widely used for solving complex sequential decision-making tasks, but still exhibit difficulty in generalizing to scenarios not seen during training. While prior online approaches demonstrated that using additional signals beyond the reward function can lead to better generalization capabilities in RL agents, i.e. using self-supervised learning (SSL), they struggle in the offline RL setting, i.e. learning from a static dataset. We show that performance of online algorithms for generalization in RL can be hindered in the offline setting due to poor estimation of similarity between observations. We propose a new theoretically-motivated framework called Generalized Similarity Functions (GSF), which uses contrastive learning to train an offline RL agent to aggregate observations based on the similarity of their expected future behavior, where we quantify this similarity using generalized value functions. We show that GSF is general enough to recover existing SSL objectives while also improving zero-shot generalization performance on a complex offline RL benchmark, offline Procgen. |
Bogdan Mazoure · Ilya Kostrikov · Ofir Nachum · Jonathan Tompson 🔗 |
-
|
Adaptive Behavior Cloning Regularization for Stable Offline-to-Online Reinforcement Learning
(
Poster
)
Offline reinforcement learning, by learning from a fixed dataset, makes it possible to learn agent behaviors without interacting with the environment. However, depending on the quality of the offline dataset, such pre-trained agents may have limited performance and would further need to be fine-tuned online by interacting with the environment. During online fine-tuning, the performance of the pre-trained agent may collapse quickly due to the sudden distribution shift from offline to online data. While constraints enforced by offline RL methods such as a behaviour cloning loss prevent this to an extent, these constraints also significantly slow down online fine-tuning by forcing the agent to stay close to the behavior policy. We propose to adaptively weigh the behavior cloning loss during online fine-tuning based on the agent's performance and training stability. Moreover, we use a randomized ensemble of Q functions to further increase the sample efficiency of online fine-tuning by performing a large number of learning updates. Experiments show that the proposed method yields state-of-the-art offline-to-online reinforcement learning performance on the popular D4RL benchmark. |
Yi Zhao · Rinu Boney · Alexander Ilin · Juho Kannala · Joni Pajarinen 🔗 |
-
|
What Would the Expert $do(\cdot)$?: Causal Imitation Learning
(
Poster
)
We develop algorithms for imitation learning from policy data that was corrupted by unobserved confounders. Sources of such confounding include (a) persistent perturbations to actions or (b) the expert responding to a part of the state that the learner does not have access to. When a confounder affects multiple timesteps of recorded data, it can manifest as spurious correlations between states and actions that a learner might latch on to, leading to poor policy performance. To break up these spurious correlations, we apply modern variants of the classical instrumental variable regression (IVR) technique, enabling us to recover the causally correct underlying policy without requiring access to an interactive expert. In particular, we present two techniques, one of a generative-modeling flavor (DoubIL) that can utilize access to a simulator and one of a game-theoretic flavor (ResiduIL) that can be run entirely offline. We discuss, from the perspective of performance, the types of confounding under which it is better to use an IVR-based technique instead of behavioral cloning and vice versa. We find both of our algorithms compare favorably to behavioral cloning on a simulated rocket landing task. |
Gokul Swamy · Sanjiban Choudhury · James Bagnell · Steven Wu 🔗 |
-
|
Quantile Filtered Imitation Learning
(
Poster
)
We introduce quantile filtered imitation learning (QFIL), a novel policy improvement operator designed for offline reinforcement learning. QFIL performs policy improvement by running imitation learning on a filtered version of the experience dataset. The filtering process removes s,a pairs whose estimated Q values fall below a given quantile of the pushforward distribution over values induced by sampling actions from the behavior policy. The definitions of both the pushforward Q distribution and resulting value function quantile are key contributions of our method. We prove that QFIL gives us a safe policy improvement step with function approximation and that the choice of quantile provides a natural hyperparameter to trade off bias and variance of the improvement step. Empirically, we perform a synthetic experiment illustrating how QFIL effectively makes a bias-variance tradeoff and we see that QFIL performs well on the D4RL benchmark. |
David Brandfonbrener · Will Whitney · Rajesh Ranganath · Joan Bruna 🔗 |
-
|
Benchmarking Sample Selection Strategies for Batch Reinforcement Learning
(
Poster
)
Training sample selection techniques, such as prioritized experience replay (PER), have been recognized as of significant importance for online reinforcement learn- ing algorithms. Efficient sample selection can help further improve the learning efficiency and the final performance. However, the impact of sample selection for batch reinforcement learning (RL) has not been well studied. In this work, we investigate the application of non-uniform sampling techniques in batch RL. In particular, we compare six variants of PER based on various heuristic priority metrics that focus on different aspects of the offline learning setting. These metrics include temporal-difference error, n-step return, self-imitation learning objective, pseudo-count, uncertainty, and likelihood. Through extensive experiments on the standard batch RL datasets, we find that non-uniform sampling is also effective in batch RL settings. Further, there is no single metric that works in all situations. The investigation also shows that it is insufficient to avoid the bootstrapping error in batch reinforcement learning by only changing the sampling scheme. |
Yuwei Fu · Di Wu · Benoit Boulet 🔗 |
-
|
Dynamic Mirror Descent based Model Predictive Control for Accelerating Robot Learning
(
Poster
)
Recent works in Reinforcement Learning (RL) combine model-free (Mf)-RL algorithms with model-based (Mb)-RL approaches to get the best from both: asymptotic performance of Mf-RL and high sample efficiency of Mb-RL. Inspired by these works, we propose a hierarchical framework that integrates online learning for the Mb-trajectory optimization with off-policy methods for the Mf-RL. In particular, two loops are proposed, where the Dynamic Mirror Descent based Model Predictive Control (DMD-MPC) is used as the inner loop Mb-RL to obtain an optimal sequence of actions. These actions are in turn used to significantly accelerate the outer loop Mf-RL. We show that our formulation is generic for a broad class of MPC-based policies and objectives, and includes some of the well-known Mb-Mf approaches. We finally introduce a new algorithm: Mirror-Descent Model Predictive RL (M-DeMoRL), which uses Cross-Entropy Method (CEM) with elite fractions for the inner loop. Our experiments show faster convergence of the proposed hierarchical approach on benchmark MuJoCo tasks. We also demonstrate hardware training for trajectory tracking in a 2R leg, and hardware transfer for robust walking in a quadruped. We show that the inner-loop Mb-RL significantly decreases the number of training iterations required in the real system, thereby validating the proposed approach. |
Utkarsh A Mishra · Soumya Samineni · Aditya Varma Sagi · Shalabh Bhatnagar · Shishir N Y 🔗 |
-
|
MBAIL: Multi-Batch Best Action Imitation Learning utilizing Sample Transfer and Policy Distillation
(
Poster
)
Most online reinforcement learning (RL) algorithms require a large number of interactions with the environment to learn a reliable control policy. Unfortunately, the assumption of the availability of repeated interactions with the environment does not hold for many real-world applications. Batch RL aims to learn a good control policy from a previously collected dataset without requiring additional interactions with the environment, which are very promising in solving real-world problems. However, in the real world, we may only have a limited amount of data points for certain tasks we are interested in. Also, most of the current batch RL methods are mainly aimed to learn policy over one fixed dataset with which it is hard to learn a policy that can perform well over multiple tasks. In this work, we propose to tackle these challenges with sample transfer and policy distillation. The proposed methods are evaluated on multiple control tasks to showcase their effectiveness. |
Di Wu · · David Meger · Michael Jenkin · Steve Liu · Gregory Dudek 🔗 |
-
|
Showing Your Offline Reinforcement Learning Work: Online Evaluation Budget Matters
(
Poster
)
Over the recent years, vast progress has been made in Offline Reinforcement Learning (Offline-RL) for various decision-making domains: from finance to robotics. However, comparing and reporting new Offline-RL algorithms has been noted as underdeveloped: (1) use of unlimited online evaluation budget for hyperparameter search (2) sidestepping offline policy selection (3) ad-hoc performance statistics reporting. In this work, we propose an evaluation technique addressing these issues, Expected Online Performance, that provides a performance estimate for a best-found policy given a fixed online evaluation budget. Using our approach, we can estimate the number of online evaluations required to surpass a given behavioral policy performance. Applying it to several Offline-RL baselines, we find that with a limited online evaluation budget, (1) Behavioral Cloning constitutes a strong baseline over various expert levels and data regimes, and (2) offline uniform policy selection is competitive with value-based approaches. We hope the proposed technique will make it into the toolsets of Offline-RL practitioners to help them arrive at informed conclusions when deploying RL in real-world systems. |
Vladislav Kurenkov · Sergey Kolesnikov 🔗 |
-
|
Offline Reinforcement Learning with Munchausen Regularization
(
Poster
)
Most temporal differences based (TD-based) Reinforcement Learning (RL) methods focus on replacing the true value of a transiting state by their current estimate of this value. Munchausen-RL (M-RL) proposes the idea of incorporating the current policy to be leveraged to bootstrap RL. The concept of penalizing two consecutive policies that are far from each other is also applicable to offline settings. In our work, we add the Munchausen term in the Q-update step to penalize policies that deviate from previous policy too far. Our results indicate that this method could be implemented in various offline Q-learning methods to help improve the performance. In addition, we evaluate how prioritized experience replay affects offline RL. Our results show that Munchausen Offline RL outperforms the original methods that are without the regularization term. |
Hsin-Yu Liu · Bharathan Balaji · Dezhi Hong 🔗 |
-
|
Importance of Empirical Sample Complexity Analysis for Offline Reinforcement Learning
(
Poster
)
We hypothesize that empirically studying the sample complexity of offline reinforcement learning (RL) is crucial for the practical applications of RL in the real world. Several recent works have demonstrated the ability to learn policies directly from offline data. In this work, we ask the question of the dependency on the number of samples for learning from offline data. Our objective is to emphasize that studying sample complexity for offline RL is important, and is an indicator of the usefulness of existing offline algorithms. We propose an evaluation approach for sample complexity analysis of offline RL. |
Samin Yeasar Arnob · Riashat Islam · Doina Precup 🔗 |
-
|
Discrete Uncertainty Quantification Approach for Offline RL
(
Poster
)
In many Reinforcement Learning tasks, the classical online interaction of the learning agent with the environment is impractical, either because such interaction is expensive or dangerous. In these cases, previous gathered data can be used, arising what is typically called Offline Reinforcement Learning. However, this type of learning faces a large number of challenges, mostly derived from the fact that exploration/exploitation trade-off is overshadowed. Instead, the historical data is usually biased by the way it was obtained, typically, a sub-optimal controller, producing a distributional shift from historical data and the one required to learn the optimal policy. |
Javier Corrochano · Rubén Majadas · FERNANDO FERNANDEZ 🔗 |
-
|
Pretraining for Language-Conditioned Imitation with Transformers
(
Poster
)
We study reinforcement learning (RL) agents which can utilize language inputs and efficiently learn on downstream tasks. To investigate this, we propose a new multimodal benchmark -- Text-Conditioned Frostbite -- in which an agent must complete tasks specified by text instructions in the Atari Frostbite environment. We curate and release a dataset of 5M text-labelled transitions for training, and to encourage further research in this direction. On this benchmark, we evaluate Text Decision Transformer (TDT), a transformer directly operating on text, state, and action tokens, and find it improves upon baseline architectures. Furthermore, we evaluate the effect of pretraining, finding unsupervised pretraining can yield improved results in low-data settings. |
Aaron Putterman · Kevin Lu · Igor Mordatch · Pieter Abbeel 🔗 |
-
|
Stateful Offline Contextual Policy Evaluation and Learning
(
Poster
)
We study off-policy evaluation and learning from sequential data in a structured class of Markov decision processes that arise from repeated interactions with an exogenous sequence of arrivals with contexts, which generate unknown individual-level responses to agent actions that induce known transitions. This is a relevant model, for example, for dynamic personalized pricing and other operations management problems in the presence of potentially high-dimensional user types. The individual-level response is not causally affected by the state variable. In this setting, we adapt doubly-robust estimation in the single-timestep setting to the sequential setting so that a state-dependent policy can be learned even from a single timestep's worth of data. We introduce a \textit{marginal MDP} model and study an algorithm for off-policy learning, which can be viewed as fitted value iteration in the marginal MDP. We also provide structural results on when errors in the response model leads to the persistence, rather than attenuation, of error over time. In simulations, we show that the advantages of doubly-robust estimation in the single time-step setting, via unbiased and lower-variance estimation, can directly translate to improved out-of-sample policy performance. This structure-specific analysis sheds light on the underlying structure on a class of problems, operations research/management problems, often heralded as a real-world domain for offline RL, which are in fact qualitatively easier. |
Angela Zhou 🔗 |
-
|
Offline Reinforcement Learning: Fundamental Barriers for Value Function Approximation
(
Poster
)
We consider offline reinforcement learning, where the goal is to learn a decision making policy from logged data. Offline RL—particularly when coupled with (value) function approximation to allow for generalization in large/continuous state spaces—is becoming increasingly relevant in practice, because it avoids costly and time-consuming online data collection and is well-suited to safety-critical domains. Existing sample complexity guarantees for offline value function approximation methods typically require both (1) distributional assumptions (i.e., good coverage) and (2) representational assumptions (i.e., ability to represent some or all Q-value functions) stronger than what is required for supervised learning. However, the necessity of these conditions and the fundamental limits for offline RL are not well-understood in spite of decades of research. This led Chen and Jiang (2019) to conjecture that concentrability (the most standard notion of coverage) and realizability (the weakest representation condition) alone are not sufficient for sample-efficient offline RL. We resolve this conjecture in the positive by proving (information theoretically) that even if both concentrability and realizability are satisfied, any algorithm requires sample complexity polynomial in the size of the state space to learn a non-trivial policy. Our results show that sample-efficient offline reinforcement learning requires either restrictive coverage conditions or representation conditions beyond what is required in classical supervised learning, and highlight a phenomenon called over-coverage which serves as a fundamental barrier for offline value function approximation methods. |
Dylan Foster · Akshay Krishnamurthy · David Simchi-Levi · Yunzong Xu 🔗 |
-
|
Learning Value Functions from Undirected State-only Experience
(
Poster
)
This paper tackles the problem of learning value functions from undirected state-only experience (state transitions without action labels i.e. (s,s',r) tuples). We first theoretically characterize the applicability of Q-learning in this setting. We show that tabular Q-learning in discrete Markov decision processes (MDPs) learns the same value function under any arbitrary refinement of the action space. This theoretical result motivates the design of Latent Action Q-learning or LAQ, an offline RL method that can learn effective value functions from state-only experience. Latent Action Q-learning (LAQ) learns value functions using Q-learning on discrete latent actions obtained through a latent-variable future prediction model. We show that LAQ can recover value functions that have high correlation with value functions learned using ground truth actions. Value functions learned using LAQ lead to sample efficient acquisition of goal-directed behavior, can be used with domain-specific low-level controllers, and facilitate transfer across embodiments. Our experiments in 5 environments ranging from 2D grid world to 3D visual navigation in realistic environments demonstrate the benefits of LAQ over simpler alternatives, imitation learning oracles, and competing methods. |
Matthew Chang · Arjun Gupta · Saurabh Gupta 🔗 |
-
|
Discriminator-Weighted Offline Imitation Learning from Suboptimal Demonstrations
(
Poster
)
We study the problem of offline Imitation Learning (IL) where an agent aims to learn an optimal expert behavior policy without additional online environment interactions. Instead, the agent is provided with a static offline dataset of state-action-next state transition triples from both optimal and non-optimal expert behaviors. This strictly offline imitation learning problem arises in many real-world problems, where environment interactions and expert annotations are costly. Prior works that address the problem either require that expert data occupies the majority proportion of the offline dataset, or need to learn a reward function and perform offline reinforcement learning (RL) based on the learned reward function. In this paper, we propose an imitation learning algorithm to address the problem without additional steps of reward learning and offline RL training for the case when demonstrations containing large proportion of suboptimal data. Built upon behavioral cloning (BC), we introduce an additional discriminator to distinguish expert and non-expert data, we propose a cooperation strategy to boost the performance of both tasks, this will result in a new policy learning objective and surprisingly, we find its equivalence to a generalized BC objective, where the outputs of discriminator serve as the weights of the BC loss function. Experimental results show that the proposed algorithm can learn behavior policies that are much closer to the optimal policies than policies learned by baseline algorithms. |
Haoran Xu · Xianyuan Zhan · Honglei Yin · 🔗 |
-
|
Model-Based Offline Planning with Trajectory Pruning
(
Poster
)
Offline reinforcement learning (RL) enables learning policies using pre-collected datasets without environment interaction, which provides a promising direction to make RL usable in real-world systems. Although recent offline RL studies have achieved much progress, existing methods still face many practical challenges in real-world system control tasks, such as computational restriction during agent training and the requirement of extra control flexibility. Model-based planning framework provides an attractive solution for such tasks. However, most model-based planning algorithms are not designed for offline settings. Simply combining the ingredients of offline RL with existing methods either provides over-restrictive planning or leads to inferior performance. We propose a new light-weighted model-based offline planning framework, namely MOPP, which tackles the dilemma between the restrictions of offline learning and high-performance planning. MOPP encourages more aggressive trajectory rollout guided by the behavior policy learned from data, and prunes out problematic trajectories to avoid potential out-of-distribution samples. Experimental results show that MOPP provides competitive performance compared with existing model-based offline planning and RL approaches. |
Xianyuan Zhan · Xiangyu Zhu · Haoran Xu 🔗 |
-
|
TRAIL: Near-Optimal Imitation Learning with Suboptimal Data
(
Poster
)
The aim in imitation learning is to learn effective policies by utilizing near-optimal expert demonstrations. However, high-quality demonstrations from human experts can be expensive to obtain in large number. On the other hand, it is often much easier to obtain large quantities of suboptimal or task-agnostic trajectories, which are not useful for direct imitation, but can nevertheless provide insight into the dynamical structure of the environment, showing what could be done in the environment even if not what should be done. Is it possible to formalize these conceptual benefits and devise algorithms to use offline datasets to yield provable improvements to the sample-efficiency of imitation learning? In this work, we study this question and present training objectives that use offline datasets to learn a factored transition model whose structure enables the extraction of a latent action space. Our theoretical analysis shows that the learned latent action space can boost the sample-efficiency of downstream imitation learning, effectively reducing the need for large near-optimal expert datasets through the use of auxiliary non-expert data. To learn the latent action space in practice, we propose TRAIL (Transition-Reparametrized Actions for Imitation Learning), an algorithm that learns an energy-based transition model contrastively, and uses the transition model to reparametrize the action space for sample-efficient imitation learning. We evaluate the practicality of our objective through experiments on a set of navigation and locomotion tasks. Our results verify the benefits suggested by our theory and show that TRAIL is able to recover near-optimal policies with fewer expert trajectories. |
Mengjiao (Sherry) Yang · Sergey Levine · Ofir Nachum 🔗 |
-
|
Offline Meta-Reinforcement Learning for Industrial Insertion
(
Poster
)
Reinforcement learning (RL) can in principle make it possible for robots to automatically adapt to new tasks, but in practice current RL methods require a very large number of trials to accomplish this. In this paper, we tackle rapid adaptation to new tasks through the framework of meta-learning, which utilizes past tasks to learn to adapt, with a specific focus on industrial insertion tasks. We address two specific challenges by applying meta-learning in this setting. First, conventional meta-RL algorithms require lengthy online meta-training phases. We show that this can be replaced with appropriately chosen offline data, resulting in an offline meta-RL method that only requires demonstrations and trials from each of the prior tasks, without the need to run costly meta-RL procedures online. Second, meta-RL methods can fail to generalize to new tasks that are too different from those seen at meta-training time, which poses a particular challenge in industrial applications, where high success rates are critical. We address this by combining contextual meta-learning with direct online finetuning: if the new task is similar to those seen in the prior data, then the contextual meta-learner adapts immediately, and if it is too different, it gradually adapts through finetuning. We show that our approach is able to quickly adapt to a variety of different insertion tasks, learning how to perform them with a success rate of 100% using only a fraction of the samples needed for learning the tasks from scratch. Experiment videos and details are available at https://sites.google.com/view/oda-anon. |
Tony Zhao · Jianlan Luo · Oleg Sushkov · Rugile Pevceviciute · Nicolas Heess · Jonathan Scholz · Stefan Schaal · Sergey Levine 🔗 |
-
|
Sim-to-Real Interactive Recommendation via Off-Dynamics Reinforcement Learning
(
Poster
)
Interactive recommender systems (IRS) have received growing attention due to its awareness of long-term engagement and dynamic preference. Although the long-term planning perspective of reinforcement learning (RL) naturally fits the IRS setup, RL methods require a large amount of online user interaction, which is restricted due to economic considerations. To train agents with limited interaction data, previous works often count on building simulators to mimic user behaviors in real systems. This poses potential challenges to the success of sim-to-real transfer. In practice, such transfer easily fails as user dynamics is highly unpredictable and sensitive to the type of recommendation task. To address the above issue, we propose a novel method, S2R-Rec, to bridge the sim-to-real gap via off-dynamics RL. Generally, we expect the policy learned by only interacting with the simulator can perform well in the real environment. To achieve this, we conduct dynamics adaptation to calibrate the difference of state transition using reward correction. Furthermore, we align representation discrepancy of items by representation adaptation. Instead of separating the above into two stages, we propose to jointly adapt the dynamics and representations, leading to a unified learning objective. Experiments on real-world datasets validate the superiority of our approach, which achieves about 33.18% improvements compared to the baselines. |
Junda Wu · Zhihui Xie · Tong Yu · Qizhi Li · Shuai Li 🔗 |
-
|
Why so pessimistic? Estimating uncertainties for offline rl through ensembles, and why their independence matters
(
Poster
)
In offline/batch reinforcement learning (RL), the predominant class of approaches with most success have been ``support constraint" methods, where trained policies are encouraged to remain within the support of the provided offline dataset. However, support constraints correspond to an overly pessimistic assumption that actions outside the provided data may lead to worst-case outcomes. In this work, we aim to relax this assumption by obtaining uncertainty estimates for predicted action values, and acting conservatively with respect to a lower-confidence bound (LCB) on these estimates. Motivated by the success of ensembles for uncertainty estimation in supervised learning, we propose MSG, an offline RL method that employs an ensemble of independently updated Q-functions. First, theoretically, by referring to the literature on infinite-width neural networks, we demonstrate the crucial dependence of the quality of derived uncertainties on the manner in which ensembling is performed, a phenomenon that arises due to the dynamic programming nature of RL and overlooked by existing offline RL methods. Our theoretical predictions are corroborated by pedagogical examples on toy MDPs, as well as empirical comparisons in benchmark continuous control domains. In the significantly more challenging antmaze domains of the D4RL benchmark, MSG with deep ensembles by a wide margin surpasses highly well-tuned state-of-the-art methods. Consequently, we investigate whether efficient approximations can be similarly effective. We demonstrate that while some very efficient variants also outperform current state-of-the-art, they do not match the performance and robustness of MSG with deep ensembles. We hope that the significant impact of our less pessimistic approach engenders increased focus into uncertainty estimation techniques directed at RL, and engenders new efforts from the community of deep network uncertainty estimation researchers. |
Kamyar Ghasemipour · Shixiang (Shane) Gu · Ofir Nachum 🔗 |
-
|
Example-Based Offline Reinforcement Learning without Rewards
(
Poster
)
Offline reinforcement learning (RL) methods, which tackle the problem of learning a policy from a static dataset, have shown promise in deploying RL in real-world scenarios. Offline RL allows the re-use and accumulation of large datasets while mitigating safety concerns that arise in online exploration. However, prior works require human-defined reward labels to learn from offline datasets. Reward specification remains a major challenge for deep RL algorithms and also poses an issue for offline RL in the real world since designing reward functions could take considerable manual effort and also potentially requires installing extra hardware such as visual sensors on robots to detect the completion of a task. In contrast, in many settings, it is easier for users to provide examples of a completed task such as images than specifying a complex reward function. Based on this observation, we propose an algorithm that can learn behaviors from offline datasets without reward labels, instead of using a small number of example images. Our method learns a conservative classifier that directly learns a Q-function from the offline dataset and the successful examples while penalizing the Q-values to prevent distributional shift. Through extensive empirical results, we find that our method outperforms prior imitation learning algorithms and inverse RL methods by 53% that directly learn rewards in vision-based robot manipulation domains |
Kyle Hatch · Tianhe Yu · Rafael Rafailov · Chelsea Finn 🔗 |
-
|
The Reflective Explorer: Online Meta-Exploration from Offline Data in Realistic Robotic Tasks
(
Poster
)
Reinforcement learning is difficult to apply to real world problems due to high sample complexity, the need to adapt to frequent distribution shifts and the complexities of learning from high-dimensional inputs, such as images. Over the last several years, meta-learning has emerged as a promising approach to tackle these problems by explicitly training an agent to quickly adapt to new tasks. However, such methods still require huge amounts of data during training and are difficult to optimize in high-dimensional domains. One potential solution is to consider offline or batch meta-reinforcement learning (RL) - learning from existing datasets without additional environment interactions during training. In this work we develop the first offline model-based meta-RL algorithm that operates from images in tasks with sparse rewards. Our approach has three main components: a novel strategy to construct meta-exploration trajectories from offline data, which allows agents to learn meaningful meta-test time task inference strategy; representation learning via variational filtering and latent conservative model-free policy optimization. We show that our method completely solves a realistic meta-learning task involving robot manipulation, while naive combinations of previous approaches fail. |
Rafael Rafailov · · Tianhe Yu · Avi Singh · Mariano Phielipp · Chelsea Finn 🔗 |
Author Information
Rishabh Agarwal (Google Research, Brain Team)
My research work mainly revolves around deep reinforcement learning (RL), often with the goal of making RL methods suitable for real-world problems, and includes an outstanding paper award at NeurIPS.
Aviral Kumar (UC Berkeley)
George Tucker (Google Brain)
Justin Fu (UC Berkeley)
Nan Jiang (University of Illinois at Urbana-Champaign)
Doina Precup (McGill University / Mila / DeepMind Montreal)
Aviral Kumar (UC Berkeley)
More from the Same Authors
-
2021 : Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning »
Cameron Voloshin · Hoang Le · Nan Jiang · Yisong Yue -
2021 Spotlight: Neural Additive Models: Interpretable Machine Learning with Neural Nets »
Rishabh Agarwal · Levi Melnick · Nicholas Frosst · Xuezhou Zhang · Ben Lengerich · Rich Caruana · Geoffrey Hinton -
2021 : Data Sharing without Rewards in Multi-Task Offline Reinforcement Learning »
Tianhe Yu · Aviral Kumar · Yevgen Chebotar · Chelsea Finn · Sergey Levine · Karol Hausman -
2021 : Should I Run Offline Reinforcement Learning or Behavioral Cloning? »
Aviral Kumar · Joey Hong · Anikait Singh · Sergey Levine -
2021 : DR3: Value-Based Deep Reinforcement Learning Requires Explicit Regularization »
Aviral Kumar · Rishabh Agarwal · Tengyu Ma · Aaron Courville · George Tucker · Sergey Levine -
2021 : Offline Policy Selection under Uncertainty »
Mengjiao (Sherry) Yang · Bo Dai · Ofir Nachum · George Tucker · Dale Schuurmans -
2021 : Behavior Predictive Representations for Generalization in Reinforcement Learning »
Siddhant Agarwal · Aaron Courville · Rishabh Agarwal -
2021 : Single-Shot Pruning for Offline Reinforcement Learning »
Samin Yeasar Arnob · · Sergey Plis · Doina Precup -
2021 : Importance of Empirical Sample Complexity Analysis for Offline Reinforcement Learning »
Samin Yeasar Arnob · Riashat Islam · Doina Precup -
2022 Poster: Tiered Reinforcement Learning: Pessimism in the Face of Uncertainty and Constant Regret »
Jiawei Huang · Li Zhao · Tao Qin · Wei Chen · Nan Jiang · Tie-Yan Liu -
2022 : A Novel Stochastic Gradient Descent Algorithm for LearningPrincipal Subspaces »
Charline Le Lan · Joshua Greaves · Jesse Farebrother · Mark Rowland · Fabian Pedregosa · Rishabh Agarwal · Marc Bellemare -
2022 : The Paradox of Choice: On the Role of Attention in Hierarchical Reinforcement Learning »
Andrei Nica · Khimya Khetarpal · Doina Precup -
2022 : Proto-Value Networks: Scaling Representation Learning with Auxiliary Tasks »
Jesse Farebrother · Joshua Greaves · Rishabh Agarwal · Charline Le Lan · Ross Goroshin · Pablo Samuel Castro · Marc Bellemare -
2022 : Offline Q-learning on Diverse Multi-Task Data Both Scales And Generalizes »
Aviral Kumar · Rishabh Agarwal · XINYANG GENG · George Tucker · Sergey Levine -
2022 : Pre-Training for Robots: Leveraging Diverse Multitask Data via Offline Reinforcement Learning »
Aviral Kumar · Anikait Singh · Frederik Ebert · Yanlai Yang · Chelsea Finn · Sergey Levine -
2022 : Offline Reinforcement Learning from Heteroskedastic Data Via Support Constraints »
Anikait Singh · Aviral Kumar · Quan Vuong · Yevgen Chebotar · Sergey Levine -
2022 : Multi-Environment Pretraining Enables Transfer to Action Limited Datasets »
David Venuto · Mengjiao (Sherry) Yang · Pieter Abbeel · Doina Precup · Igor Mordatch · Ofir Nachum -
2022 : Imitation Is Not Enough: Robustifying Imitation with Reinforcement Learning for Challenging Driving Scenarios »
Yiren Lu · Yiren Lu · Yiren Lu · Justin Fu · George Tucker · Xinlei Pan · Eli Bronstein · Rebecca Roelofs · Benjamin Sapp · Brandyn White · Aleksandra Faust · Shimon Whiteson · Dragomir Anguelov · Sergey Levine -
2022 : Proto-Value Networks: Scaling Representation Learning with Auxiliary Tasks »
Jesse Farebrother · Joshua Greaves · Rishabh Agarwal · Charline Le Lan · Ross Goroshin · Pablo Samuel Castro · Marc Bellemare -
2022 : Confidence-Conditioned Value Functions for Offline Reinforcement Learning »
Joey Hong · Aviral Kumar · Sergey Levine -
2022 : Efficient Deep Reinforcement Learning Requires Regulating Statistical Overfitting »
Qiyang Li · Aviral Kumar · Ilya Kostrikov · Sergey Levine -
2022 : Revisiting Bellman Errors for Offline Model Selection »
Joshua Zitovsky · Rishabh Agarwal · Daniel de Marchi · Michael Kosorok -
2022 : Bayesian Q-learning With Imperfect Expert Demonstrations »
Fengdi Che · Xiru Zhu · Doina Precup · David Meger · Gregory Dudek -
2022 : Trajectory-based Explainability Framework for Offline RL »
Shripad Deshmukh · Arpan Dasgupta · Chirag Agarwal · Nan Jiang · Balaji Krishnamurthy · Georgios Theocharous · Jayakumar Subramanian -
2022 : AMORE: A Model-based Framework for Improving Arbitrary Baseline Policies with Offline Data »
Tengyang Xie · Mohak Bhardwaj · Nan Jiang · Ching-An Cheng -
2022 : Complete the Missing Half: Augmenting Aggregation Filtering with Diversification for Graph Convolutional Networks »
Sitao Luan · Mingde Zhao · Chenqing Hua · Xiao-Wen Chang · Doina Precup -
2022 : Revisiting Bellman Errors for Offline Model Selection »
Joshua Zitovsky · Daniel de Marchi · Rishabh Agarwal · Michael Kosorok -
2022 : Confidence-Conditioned Value Functions for Offline Reinforcement Learning »
Joey Hong · Aviral Kumar · Sergey Levine -
2022 : Proto-Value Networks: Scaling Representation Learning with Auxiliary Tasks »
Jesse Farebrother · Joshua Greaves · Rishabh Agarwal · Charline Le Lan · Ross Goroshin · Pablo Samuel Castro · Marc Bellemare -
2022 : Bayesian Q-learning With Imperfect Expert Demonstrations »
Fengdi Che · Xiru Zhu · Doina Precup · David Meger · Gregory Dudek -
2022 : Efficient Deep Reinforcement Learning Requires Regulating Statistical Overfitting »
Qiyang Li · Aviral Kumar · Ilya Kostrikov · Sergey Levine -
2022 : Pre-Training for Robots: Leveraging Diverse Multitask Data via Offline Reinforcement Learning »
Anikait Singh · Aviral Kumar · Frederik Ebert · Yanlai Yang · Chelsea Finn · Sergey Levine -
2022 : Offline Reinforcement Learning from Heteroskedastic Data Via Support Constraints »
Anikait Singh · Aviral Kumar · Quan Vuong · Yevgen Chebotar · Sergey Levine -
2022 : Investigating Multi-task Pretraining and Generalization in Reinforcement Learning »
Adrien Ali Taiga · Rishabh Agarwal · Jesse Farebrother · Aaron Courville · Marc Bellemare -
2023 Poster: For SALE: State-Action Representation Learning for Deep Reinforcement Learning »
Scott Fujimoto · Wei-Di Chang · Edward Smith · Shixiang (Shane) Gu · Doina Precup · David Meger -
2023 Poster: ReDS: Offline RL With Heteroskedastic Datasets via Support Constraints »
Anikait Singh · Aviral Kumar · Quan Vuong · Yevgen Chebotar · Sergey Levine -
2023 Poster: When Do Graph Neural Networks Help with Node Classification: Investigating the Homophily Principle on Node Distinguishability »
Sitao Luan · Chenqing Hua · Minkai Xu · Qincheng Lu · Jiaqi Zhu · Xiao-Wen Chang · Jie Fu · Jure Leskovec · Doina Precup -
2023 Poster: A Definition of Continual Reinforcement Learning »
David Abel · Andre Barreto · Benjamin Van Roy · Doina Precup · Hado van Hasselt · Satinder Singh -
2023 Poster: Beyond Uniform Sampling: Offline Reinforcement Learning with Imbalanced Datasets »
Zhang-Wei Hong · Aviral Kumar · Sathwik Karnik · Abhishek Bhandwaldar · Akash Srivastava · Joni Pajarinen · Romain Laroche · Abhishek Gupta · Pulkit Agrawal -
2023 Poster: Prediction and Control in Continual Reinforcement Learning »
Nishanth Anand · Doina Precup -
2023 Poster: Future-Dependent Value-Based Off-Policy Evaluation in POMDPs »
Masatoshi Uehara · Haruka Kiyohara · Andrew Bennett · Victor Chernozhukov · Nan Jiang · Nathan Kallus · Chengchun Shi · Wen Sun -
2023 Poster: Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning »
Mitsuhiko Nakamoto · Yuexiang Zhai · Anikait Singh · Max Sobol Mark · Yi Ma · Chelsea Finn · Aviral Kumar · Sergey Levine -
2023 Poster: Adversarial Model for Offline Reinforcement Learning »
Mohak Bhardwaj · Tengyang Xie · Byron Boots · Nan Jiang · Ching-An Cheng -
2023 Poster: DriveMax: An Accelerated, Data-Driven Simulator for Large-Scale Autonomous Driving Research »
Cole Gulino · Justin Fu · Wenjie Luo · George Tucker · Eli Bronstein · Yiren Lu · Jean Harb · Xinlei Pan · Yan Wang · Xiangyu Chen · John Co-Reyes · Rishabh Agarwal · Rebecca Roelofs · Yao Lu · Nico Montali · Paul Mougin · Zoey Yang · Brandyn White · Aleksandra Faust · Rowan McAllister · Dragomir Anguelov · Benjamin Sapp -
2022 : Ilya Kostrikov, Aviral Kumar »
Ilya Kostrikov · Aviral Kumar -
2022 : Offline Q-learning on Diverse Multi-Task Data Both Scales And Generalizes »
Aviral Kumar · Rishabh Agarwal · XINYANG GENG · George Tucker · Sergey Levine -
2022 Spotlight: Tiered Reinforcement Learning: Pessimism in the Face of Uncertainty and Constant Regret »
Jiawei Huang · Li Zhao · Tao Qin · Wei Chen · Nan Jiang · Tie-Yan Liu -
2022 Spotlight: Lightning Talks 4A-1 »
Jiawei Huang · Su Jia · Abdurakhmon Sadiev · Ruomin Huang · Yuanyu Wan · Denizalp Goktas · Jiechao Guan · Andrew Li · Wei-Wei Tu · Li Zhao · Amy Greenwald · Jiawei Huang · Dmitry Kovalev · Yong Liu · Wenjie Liu · Peter Richtarik · Lijun Zhang · Zhiwu Lu · R Ravi · Tao Qin · Wei Chen · Hu Ding · Nan Jiang · Tie-Yan Liu -
2022 Spotlight: Lightning Talks 3B-3 »
Sitao Luan · Zhiyuan You · Ruofan Liu · Linhao Qu · Yuwei Fu · Jiaxi Wang · Chunyu Wei · Jian Liang · xiaoyuan luo · Di Wu · Yun Lin · Lei Cui · Ji Wu · Chenqing Hua · Yujun Shen · Qincheng Lu · XIANGLIN YANG · Benoit Boulet · Manning Wang · Di Liu · Lei Huang · Fei Wang · Kai Yang · Jiaqi Zhu · Jin Song Dong · Zhijian Song · Xin Lu · Mingde Zhao · Shuyuan Zhang · Yu Zheng · Xiao-Wen Chang · Xinyi Le · Doina Precup -
2022 Spotlight: Revisiting Heterophily For Graph Neural Networks »
Sitao Luan · Chenqing Hua · Qincheng Lu · Jiaqi Zhu · Mingde Zhao · Shuyuan Zhang · Xiao-Wen Chang · Doina Precup -
2022 : Simulating Human Gaze with Neural Visual Attention »
Leo Schwinn · Doina Precup · Bjoern Eskofier · Dario Zanca -
2022 : Democratizing RL Research by Reusing Prior Computation »
Rishabh Agarwal -
2022 : Simulating Human Gaze with Neural Visual Attention »
Leo Schwinn · Doina Precup · Bjoern Eskofier · Dario Zanca -
2022 Workshop: 3rd Offline Reinforcement Learning Workshop: Offline RL as a "Launchpad" »
Aviral Kumar · Rishabh Agarwal · Aravind Rajeswaran · Wenxuan Zhou · George Tucker · Doina Precup · Aviral Kumar -
2022 Poster: Oracle Inequalities for Model Selection in Offline Reinforcement Learning »
Jonathan N Lee · George Tucker · Ofir Nachum · Bo Dai · Emma Brunskill -
2022 Poster: Beyond the Return: Off-policy Function Estimation under User-specified Error-measuring Distributions »
Audrey Huang · Nan Jiang -
2022 Poster: Reincarnating Reinforcement Learning: Reusing Prior Computation to Accelerate Progress »
Rishabh Agarwal · Max Schwarzer · Pablo Samuel Castro · Aaron Courville · Marc Bellemare -
2022 Poster: Revisiting Heterophily For Graph Neural Networks »
Sitao Luan · Chenqing Hua · Qincheng Lu · Jiaqi Zhu · Mingde Zhao · Shuyuan Zhang · Xiao-Wen Chang · Doina Precup -
2022 Poster: DASCO: Dual-Generator Adversarial Support Constrained Offline Reinforcement Learning »
Quan Vuong · Aviral Kumar · Sergey Levine · Yevgen Chebotar -
2022 Poster: Interaction-Grounded Learning with Action-Inclusive Feedback »
Tengyang Xie · Akanksha Saran · Dylan J Foster · Lekan Molu · Ida Momennejad · Nan Jiang · Paul Mineiro · John Langford -
2022 Poster: A Few Expert Queries Suffices for Sample-Efficient RL with Resets and Linear Value Approximation »
Philip Amortila · Nan Jiang · Dhruv Madeka · Dean Foster -
2022 Poster: On the Statistical Efficiency of Reward-Free Exploration in Non-Linear RL »
Jinglin Chen · Aditya Modi · Akshay Krishnamurthy · Nan Jiang · Alekh Agarwal -
2022 Poster: Data-Driven Offline Decision-Making via Invariant Representation Learning »
Han Qi · Yi Su · Aviral Kumar · Sergey Levine -
2022 Poster: Continuous MDP Homomorphisms and Homomorphic Policy Gradient »
Sahand Rezaei-Shoshtari · Rosie Zhao · Prakash Panangaden · David Meger · Doina Precup -
2021 : Speaker Intro »
Aviral Kumar · George Tucker -
2021 : Speaker Intro »
Aviral Kumar · George Tucker -
2021 : Retrospective Panel »
Sergey Levine · Nando de Freitas · Emma Brunskill · Finale Doshi-Velez · Nan Jiang · Rishabh Agarwal -
2021 : Invited Speaker Panel »
Sham Kakade · Minmin Chen · Philip Thomas · Angela Schoellig · Barbara Engelhardt · Doina Precup · George Tucker -
2021 : Speaker Intro »
Rishabh Agarwal · Aviral Kumar -
2021 : Speaker Intro »
Rishabh Agarwal · Aviral Kumar -
2021 : Opening Remarks »
Rishabh Agarwal · Aviral Kumar -
2021 : Behavior Predictive Representations for Generalization in Reinforcement Learning »
Siddhant Agarwal · Aaron Courville · Rishabh Agarwal -
2021 : Data-Driven Offline Optimization for Architecting Hardware Accelerators »
Aviral Kumar · Amir Yazdanbakhsh · Milad Hashemi · Kevin Swersky · Sergey Levine -
2021 : DR3: Value-Based Deep Reinforcement Learning Requires Explicit Regularization Q&A »
Aviral Kumar · Rishabh Agarwal · Tengyu Ma · Aaron Courville · George Tucker · Sergey Levine -
2021 : DR3: Value-Based Deep Reinforcement Learning Requires Explicit Regularization »
Aviral Kumar · Rishabh Agarwal · Tengyu Ma · Aaron Courville · George Tucker · Sergey Levine -
2021 Poster: Towards Hyperparameter-free Policy Selection for Offline Reinforcement Learning »
Siyuan Zhang · Nan Jiang -
2021 Poster: Bellman-consistent Pessimism for Offline Reinforcement Learning »
Tengyang Xie · Ching-An Cheng · Nan Jiang · Paul Mineiro · Alekh Agarwal -
2021 Poster: COMBO: Conservative Offline Model-Based Policy Optimization »
Tianhe Yu · Aviral Kumar · Rafael Rafailov · Aravind Rajeswaran · Sergey Levine · Chelsea Finn -
2021 Poster: Coupled Gradient Estimators for Discrete Latent Variables »
Zhe Dong · Andriy Mnih · George Tucker -
2021 Oral: Deep Reinforcement Learning at the Edge of the Statistical Precipice »
Rishabh Agarwal · Max Schwarzer · Pablo Samuel Castro · Aaron Courville · Marc Bellemare -
2021 Oral: Bellman-consistent Pessimism for Offline Reinforcement Learning »
Tengyang Xie · Ching-An Cheng · Nan Jiang · Paul Mineiro · Alekh Agarwal -
2021 Poster: Neural Additive Models: Interpretable Machine Learning with Neural Nets »
Rishabh Agarwal · Levi Melnick · Nicholas Frosst · Xuezhou Zhang · Ben Lengerich · Rich Caruana · Geoffrey Hinton -
2021 Poster: Conservative Data Sharing for Multi-Task Offline Reinforcement Learning »
Tianhe Yu · Aviral Kumar · Yevgen Chebotar · Karol Hausman · Sergey Levine · Chelsea Finn -
2021 Poster: Deep Reinforcement Learning at the Edge of the Statistical Precipice »
Rishabh Agarwal · Max Schwarzer · Pablo Samuel Castro · Aaron Courville · Marc Bellemare -
2021 Poster: Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning »
Tengyang Xie · Nan Jiang · Huan Wang · Caiming Xiong · Yu Bai -
2021 Poster: Why Generalization in RL is Difficult: Epistemic POMDPs and Implicit Partial Observability »
Dibya Ghosh · Jad Rahme · Aviral Kumar · Amy Zhang · Ryan Adams · Sergey Levine -
2020 : Design-Bench: Benchmarks for Data-Driven Offline Model-Based Optimization »
Brandon Trabucco · Aviral Kumar · XINYANG GENG · Sergey Levine -
2020 : Conservative Objective Models: A Simple Approach to Effective Model-Based Optimization »
Brandon Trabucco · Aviral Kumar · XINYANG GENG · Sergey Levine -
2020 : Closing remarks »
Raymond Chua · Feryal Behbahani · Julie J Lee · Rui Ponte Costa · Doina Precup · Blake Richards · Ida Momennejad -
2020 : Invited Talk #7 QnA - Yael Niv »
Yael Niv · Doina Precup · Raymond Chua · Feryal Behbahani -
2020 : Speaker Introduction: Yael Niv »
Doina Precup · Raymond Chua · Feryal Behbahani -
2020 : Towards Reliable Validation and Evaluation for Offline RL »
Nan Jiang -
2020 : Contributed Talk #3: Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning »
Rishabh Agarwal · Marlos C. Machado · Pablo Samuel Castro · Marc Bellemare -
2020 : Panel »
Emma Brunskill · Nan Jiang · Nando de Freitas · Finale Doshi-Velez · Sergey Levine · John Langford · Lihong Li · George Tucker · Rishabh Agarwal · Aviral Kumar -
2020 Workshop: Offline Reinforcement Learning »
Aviral Kumar · Rishabh Agarwal · George Tucker · Lihong Li · Doina Precup · Aviral Kumar -
2020 : Introduction »
Aviral Kumar · George Tucker · Rishabh Agarwal -
2020 : Panel Discussions »
Grace Lindsay · George Konidaris · Shakir Mohamed · Kimberly Stachenfeld · Peter Dayan · Yael Niv · Doina Precup · Catherine Hartley · Ishita Dasgupta -
2020 Workshop: Biological and Artificial Reinforcement Learning »
Raymond Chua · Feryal Behbahani · Julie J Lee · Sara Zannone · Rui Ponte Costa · Blake Richards · Ida Momennejad · Doina Precup -
2020 : Organizers Opening Remarks »
Raymond Chua · Feryal Behbahani · Julie J Lee · Ida Momennejad · Rui Ponte Costa · Blake Richards · Doina Precup -
2020 : Keynote: Doina Precup »
Doina Precup -
2020 Poster: Model Inversion Networks for Model-Based Optimization »
Aviral Kumar · Sergey Levine -
2020 Poster: Reward Propagation Using Graph Convolutional Networks »
Martin Klissarov · Doina Precup -
2020 Spotlight: Reward Propagation Using Graph Convolutional Networks »
Martin Klissarov · Doina Precup -
2020 Poster: RL Unplugged: A Suite of Benchmarks for Offline Reinforcement Learning »
Caglar Gulcehre · Ziyu Wang · Alexander Novikov · Thomas Paine · Sergio Gómez · Konrad Zolna · Rishabh Agarwal · Josh Merel · Daniel Mankowitz · Cosmin Paduraru · Gabriel Dulac-Arnold · Jerry Li · Mohammad Norouzi · Matthew Hoffman · Nicolas Heess · Nando de Freitas -
2020 Poster: DisARM: An Antithetic Gradient Estimator for Binary Latent Variables »
Zhe Dong · Andriy Mnih · George Tucker -
2020 Spotlight: DisARM: An Antithetic Gradient Estimator for Binary Latent Variables »
Zhe Dong · Andriy Mnih · George Tucker -
2020 Poster: Conservative Q-Learning for Offline Reinforcement Learning »
Aviral Kumar · Aurick Zhou · George Tucker · Sergey Levine -
2020 Tutorial: (Track3) Offline Reinforcement Learning: From Algorithm Design to Practical Applications Q&A »
Sergey Levine · Aviral Kumar -
2020 Poster: One Solution is Not All You Need: Few-Shot Extrapolation via Structured MaxEnt RL »
Saurabh Kumar · Aviral Kumar · Sergey Levine · Chelsea Finn -
2020 Poster: An Equivalence between Loss Functions and Non-Uniform Sampling in Experience Replay »
Scott Fujimoto · David Meger · Doina Precup -
2020 Poster: Forethought and Hindsight in Credit Assignment »
Veronica Chelu · Doina Precup · Hado van Hasselt -
2020 Poster: DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction »
Aviral Kumar · Abhishek Gupta · Sergey Levine -
2020 Spotlight: DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction »
Aviral Kumar · Abhishek Gupta · Sergey Levine -
2020 Tutorial: (Track3) Offline Reinforcement Learning: From Algorithm Design to Practical Applications »
Sergey Levine · Aviral Kumar -
2019 : Panel Session: A new hope for neuroscience »
Yoshua Bengio · Blake Richards · Timothy Lillicrap · Ila Fiete · David Sussillo · Doina Precup · Konrad Kording · Surya Ganguli -
2019 : Poster and Coffee Break 2 »
Karol Hausman · Kefan Dong · Ken Goldberg · Lihong Li · Lin Yang · Lingxiao Wang · Lior Shani · Liwei Wang · Loren Amdahl-Culleton · Lucas Cassano · Marc Dymetman · Marc Bellemare · Marcin Tomczak · Margarita Castro · Marius Kloft · Marius-Constantin Dinu · Markus Holzleitner · Martha White · Mengdi Wang · Michael Jordan · Mihailo Jovanovic · Ming Yu · Minshuo Chen · Moonkyung Ryu · Muhammad Zaheer · Naman Agarwal · Nan Jiang · Niao He · Nikolaus Yasui · Nikos Karampatziakis · Nino Vieillard · Ofir Nachum · Olivier Pietquin · Ozan Sener · Pan Xu · Parameswaran Kamalaruban · Paul Mineiro · Paul Rolland · Philip Amortila · Pierre-Luc Bacon · Prakash Panangaden · Qi Cai · Qiang Liu · Quanquan Gu · Raihan Seraj · Richard Sutton · Rick Valenzano · Robert Dadashi · Rodrigo Toro Icarte · Roshan Shariff · Roy Fox · Ruosong Wang · Saeed Ghadimi · Samuel Sokota · Sean Sinclair · Sepp Hochreiter · Sergey Levine · Sergio Valcarcel Macua · Sham Kakade · Shangtong Zhang · Sheila McIlraith · Shie Mannor · Shimon Whiteson · Shuai Li · Shuang Qiu · Wai Lok Li · Siddhartha Banerjee · Sitao Luan · Tamer Basar · Thinh Doan · Tianhe Yu · Tianyi Liu · Tom Zahavy · Toryn Klassen · Tuo Zhao · Vicenç Gómez · Vincent Liu · Volkan Cevher · Wesley Suttle · Xiao-Wen Chang · Xiaohan Wei · Xiaotong Liu · Xingguo Li · Xinyi Chen · Xingyou Song · Yao Liu · YiDing Jiang · Yihao Feng · Yilun Du · Yinlam Chow · Yinyu Ye · Yishay Mansour · · Yonathan Efroni · Yongxin Chen · Yuanhao Wang · Bo Dai · Chen-Yu Wei · Harsh Shrivastava · Hongyang Zhang · Qinqing Zheng · SIDDHARTHA SATPATHI · Xueqing Liu · Andreu Vall -
2019 : Poster Presentations »
Rahul Mehta · Andrew Lampinen · Binghong Chen · Sergio Pascual-Diaz · Jordi Grau-Moya · Aldo Faisal · Jonathan Tompson · Yiren Lu · Khimya Khetarpal · Martin Klissarov · Pierre-Luc Bacon · Doina Precup · Thanard Kurutach · Aviv Tamar · Pieter Abbeel · Jinke He · Maximilian Igl · Shimon Whiteson · Wendelin Boehmer · Raphaël Marinier · Olivier Pietquin · Karol Hausman · Sergey Levine · Chelsea Finn · Tianhe Yu · Lisa Lee · Benjamin Eysenbach · Emilio Parisotto · Eric Xing · Ruslan Salakhutdinov · Hongyu Ren · Anima Anandkumar · Deepak Pathak · Christopher Lu · Trevor Darrell · Alexei Efros · Phillip Isola · Feng Liu · Bo Han · Gang Niu · Masashi Sugiyama · Saurabh Kumar · Janith Petangoda · Johan Ferret · James McClelland · Kara Liu · Animesh Garg · Robert Lange -
2019 : Poster Session »
Matthia Sabatelli · Adam Stooke · Amir Abdi · Paulo Rauber · Leonard Adolphs · Ian Osband · Hardik Meisheri · Karol Kurach · Johannes Ackermann · Matt Benatan · GUO ZHANG · Chen Tessler · Dinghan Shen · Mikayel Samvelyan · Riashat Islam · Murtaza Dalal · Luke Harries · Andrey Kurenkov · Konrad Żołna · Sudeep Dasari · Kristian Hartikainen · Ofir Nachum · Kimin Lee · Markus Holzleitner · Vu Nguyen · Francis Song · Christopher Grimm · Felipe Leno da Silva · Yuping Luo · Yifan Wu · Alex Lee · Thomas Paine · Wei-Yang Qu · Daniel Graves · Yannis Flet-Berliac · Yunhao Tang · Suraj Nair · Matthew Hausknecht · Akhil Bagaria · Simon Schmitt · Bowen Baker · Paavo Parmas · Benjamin Eysenbach · Lisa Lee · Siyu Lin · Daniel Seita · Abhishek Gupta · Riley Simmons-Edler · Yijie Guo · Kevin Corder · Vikash Kumar · Scott Fujimoto · Adam Lerer · Ignasi Clavera Gilaberte · Nicholas Rhinehart · Ashvin Nair · Ge Yang · Lingxiao Wang · Sungryull Sohn · J. Fernando Hernandez-Garcia · Xian Yeow Lee · Rupesh Srivastava · Khimya Khetarpal · Chenjun Xiao · Luckeciano Carvalho Melo · Rishabh Agarwal · Tianhe Yu · Glen Berseth · Devendra Singh Chaplot · Jie Tang · Anirudh Srinivasan · Tharun Kumar Reddy Medini · Aaron Havens · Misha Laskin · Asier Mujika · Rohan Saphal · Joseph Marino · Alex Ray · Joshua Achiam · Ajay Mandlekar · Zhuang Liu · Danijar Hafner · Zhiwen Tang · Ted Xiao · Michael Walton · Jeff Druce · Ferran Alet · Zhang-Wei Hong · Stephanie Chan · Anusha Nagabandi · Hao Liu · Hao Sun · Ge Liu · Dinesh Jayaraman · John Co-Reyes · Sophia Sanborn -
2019 : Contributed Talks »
Rishabh Agarwal · Adam Gleave · Kimin Lee -
2019 : Poster Spotlight 2 »
Aaron Sidford · Mengdi Wang · Lin Yang · Yinyu Ye · Zuyue Fu · Zhuoran Yang · Yongxin Chen · Zhaoran Wang · Ofir Nachum · Bo Dai · Ilya Kostrikov · Dale Schuurmans · Ziyang Tang · Yihao Feng · Lihong Li · Denny Zhou · Qiang Liu · Rodrigo Toro Icarte · Ethan Waldie · Toryn Klassen · Rick Valenzano · Margarita Castro · Simon Du · Sham Kakade · Ruosong Wang · Minshuo Chen · Tianyi Liu · Xingguo Li · Zhaoran Wang · Tuo Zhao · Philip Amortila · Doina Precup · Prakash Panangaden · Marc Bellemare -
2019 : Panel Discussion »
Richard Sutton · Doina Precup -
2019 : Poster and Coffee Break 1 »
Aaron Sidford · Aditya Mahajan · Alejandro Ribeiro · Alex Lewandowski · Ali H Sayed · Ambuj Tewari · Angelika Steger · Anima Anandkumar · Asier Mujika · Hilbert J Kappen · Bolei Zhou · Byron Boots · Chelsea Finn · Chen-Yu Wei · Chi Jin · Ching-An Cheng · Christina Yu · Clement Gehring · Craig Boutilier · Dahua Lin · Daniel McNamee · Daniel Russo · David Brandfonbrener · Denny Zhou · Devesh Jha · Diego Romeres · Doina Precup · Dominik Thalmeier · Eduard Gorbunov · Elad Hazan · Elena Smirnova · Elvis Dohmatob · Emma Brunskill · Enrique Munoz de Cote · Ethan Waldie · Florian Meier · Florian Schaefer · Ge Liu · Gergely Neu · Haim Kaplan · Hao Sun · Hengshuai Yao · Jalaj Bhandari · James A Preiss · Jayakumar Subramanian · Jiajin Li · Jieping Ye · Jimmy Smith · Joan Bas Serrano · Joan Bruna · John Langford · Jonathan Lee · Jose A. Arjona-Medina · Kaiqing Zhang · Karan Singh · Yuping Luo · Zafarali Ahmed · Zaiwei Chen · Zhaoran Wang · Zhizhong Li · Zhuoran Yang · Ziping Xu · Ziyang Tang · Yi Mao · David Brandfonbrener · Shirli Di-Castro · Riashat Islam · Zuyue Fu · Abhishek Naik · Saurabh Kumar · Benjamin Petit · Angeliki Kamoutsi · Simone Totaro · Arvind Raghunathan · Rui Wu · Donghwan Lee · Dongsheng Ding · Alec Koppel · Hao Sun · Christian Tjandraatmadja · Mahdi Karami · Jincheng Mei · Chenjun Xiao · Junfeng Wen · Zichen Zhang · Ross Goroshin · Mohammad Pezeshki · Jiaqi Zhai · Philip Amortila · Shuo Huang · Mariya Vasileva · El houcine Bergou · Adel Ahmadyan · Haoran Sun · Sheng Zhang · Lukas Gruber · Yuanhao Wang · Tetiana Parshakova -
2019 : Invited Talk: Hierarchical Reinforcement Learning: Computational Advances and Neuroscience Connections »
Doina Precup -
2019 : Panel Discussion led by Grace Lindsay »
Grace Lindsay · Blake Richards · Doina Precup · Jacqueline Gottlieb · Jeff Clune · Jane Wang · Richard Sutton · Angela Yu · Ida Momennejad -
2019 : Poster Session »
Ahana Ghosh · Javad Shafiee · Akhilan Boopathy · Alex Tamkin · Theodoros Vasiloudis · Vedant Nanda · Ali Baheri · Paul Fieguth · Andrew Bennett · Guanya Shi · Hao Liu · Arushi Jain · Jacob Tyo · Benjie Wang · Boxiao Chen · Carroll Wainwright · Chandramouli Shama Sastry · Chao Tang · Daniel S. Brown · David Inouye · David Venuto · Dhruv Ramani · Dimitrios Diochnos · Divyam Madaan · Dmitrii Krashenikov · Joel Oren · Doyup Lee · Eleanor Quint · elmira amirloo · Matteo Pirotta · Gavin Hartnett · Geoffroy Dubourg-Felonneau · Gokul Swamy · Pin-Yu Chen · Ilija Bogunovic · Jason Carter · Javier Garcia-Barcos · Jeet Mohapatra · Jesse Zhang · Jian Qian · John Martin · Oliver Richter · Federico Zaiter · Tsui-Wei Weng · Karthik Abinav Sankararaman · Kyriakos Polymenakos · Lan Hoang · mahdieh abbasi · Marco Gallieri · Mathieu Seurin · Matteo Papini · Matteo Turchetta · Matthew Sotoudeh · Mehrdad Hosseinzadeh · Nathan Fulton · Masatoshi Uehara · Niranjani Prasad · Oana-Maria Camburu · Patrik Kolaric · Philipp Renz · Prateek Jaiswal · Reazul Hasan Russel · Riashat Islam · Rishabh Agarwal · Alexander Aldrick · Sachin Vernekar · Sahin Lale · Sai Kiran Narayanaswami · Samuel Daulton · Sanjam Garg · Sebastian East · Shun Zhang · Soheil Dsidbari · Justin Goodwin · Victoria Krakovna · Wenhao Luo · Wesley Chung · Yuanyuan Shi · Yuh-Shyang Wang · Hongwei Jin · Ziping Xu -
2019 : Opening Remarks »
Raymond Chua · Feryal Behbahani · Sara Zannone · Rui Ponte Costa · Claudia Clopath · Doina Precup · Blake Richards -
2019 Workshop: Biological and Artificial Reinforcement Learning »
Raymond Chua · Sara Zannone · Feryal Behbahani · Rui Ponte Costa · Claudia Clopath · Blake Richards · Doina Precup -
2019 Poster: Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction »
Aviral Kumar · Justin Fu · George Tucker · Sergey Levine -
2019 Poster: Graph Normalizing Flows »
Jenny Liu · Aviral Kumar · Jimmy Ba · Jamie Kiros · Kevin Swersky -
2019 Poster: Energy-Inspired Models: Learning with Sampler-Induced Distributions »
Dieterich Lawson · George Tucker · Bo Dai · Rajesh Ranganath -
2019 Poster: Don't Blame the ELBO! A Linear VAE Perspective on Posterior Collapse »
James Lucas · George Tucker · Roger Grosse · Mohammad Norouzi -
2019 Poster: Break the Ceiling: Stronger Multi-scale Deep Graph Convolutional Networks »
Sitao Luan · Mingde Zhao · Xiao-Wen Chang · Doina Precup -
2019 Poster: Provably Efficient Q-Learning with Low Switching Cost »
Yu Bai · Tengyang Xie · Nan Jiang · Yu-Xiang Wang -
2018 : Spotlights »
Guangneng Hu · Ke Li · Aviral Kumar · Phi Vu Tran · Samuel G. Fadel · Rita Kuznetsova · Bong-Nam Kang · Behrouz Haji Soleimani · Jinwon An · Nathan de Lara · Anjishnu Kumar · Tillman Weyde · Melanie Weber · Kristen Altenburger · Saeed Amizadeh · Xiaoran Xu · Yatin Nandwani · Yang Guo · Maria Pacheco · William Fedus · Guillaume Jaume · Yuka Yoneda · Yunpu Ma · Yunsheng Bai · Berk Kapicioglu · Maximilian Nickel · Fragkiskos Malliaros · Beier Zhu · Aleksandar Bojchevski · Joshua Joseph · Gemma Roig · Esma Balkir · Xander Steenbrugge -
2018 Poster: Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion »
Jacob Buckman · Danijar Hafner · George Tucker · Eugene Brevdo · Honglak Lee -
2018 Oral: Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion »
Jacob Buckman · Danijar Hafner · George Tucker · Eugene Brevdo · Honglak Lee -
2018 Poster: Temporal Regularization for Markov Decision Process »
Pierre Thodoroff · Audrey Durand · Joelle Pineau · Doina Precup -
2018 Poster: Learning Safe Policies with Expert Guidance »
Jessie Huang · Fa Wu · Doina Precup · Yang Cai -
2017 : Panel Discussion »
Matt Botvinick · Emma Brunskill · Marcos Campos · Jan Peters · Doina Precup · David Silver · Josh Tenenbaum · Roy Fox -
2017 : Progress on Deep Reinforcement Learning with Temporal Abstraction (Doina Precup) »
Doina Precup -
2017 : Doina Precup »
Doina Precup -
2017 Workshop: Hierarchical Reinforcement Learning »
Andrew G Barto · Doina Precup · Shie Mannor · Tom Schaul · Roy Fox · Carlos Florensa -
2017 Poster: EX2: Exploration with Exemplar Models for Deep Reinforcement Learning »
Justin Fu · John Co-Reyes · Sergey Levine -
2017 Poster: REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models »
George Tucker · Andriy Mnih · Chris J Maddison · John Lawson · Jascha Sohl-Dickstein -
2017 Spotlight: EX2: Exploration with Exemplar Models for Deep Reinforcement Learning »
Justin Fu · John Co-Reyes · Sergey Levine -
2017 Oral: REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models »
George Tucker · Andriy Mnih · Chris J Maddison · John Lawson · Jascha Sohl-Dickstein -
2017 Poster: Filtering Variational Objectives »
Chris Maddison · John Lawson · George Tucker · Nicolas Heess · Mohammad Norouzi · Andriy Mnih · Arnaud Doucet · Yee Teh -
2016 Workshop: The Future of Interactive Machine Learning »
Kory Mathewson @korymath · Kaushik Subramanian · Mark Ho · Robert Loftin · Joseph L Austerweil · Anna Harutyunyan · Doina Precup · Layla El Asri · Matthew Gombolay · Jerry Zhu · Sonia Chernova · Charles Isbell · Patrick M Pilarski · Weng-Keen Wong · Manuela Veloso · Julie A Shah · Matthew Taylor · Brenna Argall · Michael Littman -
2015 Poster: Data Generation as Sequential Decision Making »
Philip Bachman · Doina Precup -
2015 Spotlight: Data Generation as Sequential Decision Making »
Philip Bachman · Doina Precup -
2015 Poster: Basis refinement strategies for linear value function approximation in MDPs »
Gheorghe Comanici · Doina Precup · Prakash Panangaden -
2014 Workshop: From Bad Models to Good Policies (Sequential Decision Making under Uncertainty) »
Odalric-Ambrym Maillard · Timothy A Mann · Shie Mannor · Jeremie Mary · Laurent Orseau · Thomas Dietterich · Ronald Ortner · Peter Grünwald · Joelle Pineau · Raphael Fonteneau · Georgios Theocharous · Esteban D Arcaute · Christos Dimitrakakis · Nan Jiang · Doina Precup · Pierre-Luc Bacon · Marek Petrik · Aviv Tamar -
2014 Poster: Optimizing Energy Production Using Policy Search and Predictive State Representations »
Yuri Grinberg · Doina Precup · Michel Gendreau -
2014 Poster: Learning with Pseudo-Ensembles »
Philip Bachman · Ouais Alsharif · Doina Precup -
2014 Spotlight: Optimizing Energy Production Using Policy Search and Predictive State Representations »
Yuri Grinberg · Doina Precup · Michel Gendreau -
2013 Poster: Learning from Limited Demonstrations »
Beomjoon Kim · Amir-massoud Farahmand · Joelle Pineau · Doina Precup -
2013 Poster: Bellman Error Based Feature Generation using Random Projections on Sparse Spaces »
Mahdi Milani Fard · Yuri Grinberg · Amir-massoud Farahmand · Joelle Pineau · Doina Precup -
2013 Spotlight: Learning from Limited Demonstrations »
Beomjoon Kim · Amir-massoud Farahmand · Joelle Pineau · Doina Precup -
2012 Poster: Value Pursuit Iteration »
Amir-massoud Farahmand · Doina Precup -
2012 Poster: On-line Reinforcement Learning Using Incremental Kernel-Based Stochastic Factorization »
Andre S Barreto · Doina Precup · Joelle Pineau -
2011 Poster: Reinforcement Learning using Kernel-Based Stochastic Factorization »
Andre S Barreto · Doina Precup · Joelle Pineau -
2009 Poster: Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation »
Hamid R Maei · Csaba Szepesvari · Shalabh Batnaghar · Doina Precup · David Silver · Richard Sutton -
2009 Spotlight: Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation »
Hamid R Maei · Csaba Szepesvari · Shalabh Batnaghar · Doina Precup · David Silver · Richard Sutton -
2008 Poster: Bounding Performance Loss in Approximate MDP Homomorphisms »
Doina Precup · Jonathan Taylor Taylor · Prakash Panangaden