Timezone: »
While offline RL focuses on learning solely from fixed datasets, one of the main learning points from the previous edition of offline RL workshop was that largescale RL applications typically want to use offline RL as part of a bigger system as opposed to being the endgoal in itself. Thus, we propose to shift the focus from algorithm design and offline RL applications to how offline RL can be a launchpad , i.e., a tool or a starting point, for solving challenges in sequential decisionmaking such as exploration, generalization, transfer, safety, and adaptation. Particularly, we are interested in studying and discussing methods for learning expressive models, policies, skills and value functions from data that can help us make progress towards efficiently tackling these challenges, which are otherwise often intractable.
Submission site: https://openreview.net/group?id=NeurIPS.cc/2022/Workshop/Offline_RL. The submission deadline is September 25, 2022 (Anywhere on Earth). Please refer to the submission page for more details.
Fri 6:20 a.m.  6:30 a.m.

Opening Remarks
SlidesLive Video » 
🔗 
Fri 6:30 a.m.  7:00 a.m.

Offline RL in the context of "Collect and Infer" (Martin Riedmiller)
(Invited Talk)
SlidesLive Video » 
🔗 
Fri 7:00 a.m.  7:10 a.m.

Efficient Planning in a Compact Latent Action Space
(Contributed Talk)
SlidesLive Video » 
🔗 
Fri 7:10 a.m.  7:20 a.m.

Control Graph as Unified IO for MorphologyTask Generalization
(Contributed Talk)
SlidesLive Video » 
🔗 
Fri 7:20 a.m.  7:30 a.m.

Towards Universal Visual Reward and Representation via ValueImplicit PreTraining
(Contributed Talk)
SlidesLive Video » 
🔗 
Fri 7:35 a.m.  8:05 a.m.

AV2.0: Learning to Drive at a Global Scale (Alex Kendall)
(Invited Talk)
SlidesLive Video » 
🔗 
Fri 8:05 a.m.  9:10 a.m.

Poster Session 1
(Poster Session)

🔗 
Fri 9:10 a.m.  9:40 a.m.

Learning from Suboptimal Demonstrations with No Rewards (Dorsa Sadigh)
(Invited Talk)
SlidesLive Video » 
🔗 
Fri 9:40 a.m.  10:30 a.m.

Break

🔗 
Fri 10:45 a.m.  11:30 a.m.

Panel Discussion 1  Applications
(Panel Discussion)
SlidesLive Video » KeeEung Kim (Remote), Vijay Badrinarayanan (Remote), Taylor Killian (inperson), Tony Jebara (inperson) 
🔗 
Fri 11:30 a.m.  11:40 a.m.

Choreographer: Learning and Adapting Skills in Imagination
(Contributed Talk)
SlidesLive Video » 
🔗 
Fri 11:40 a.m.  11:50 a.m.

Provable Benefits of Representational Transfer in Reinforcement Learning
(Contributed Talk)
SlidesLive Video » 
🔗 
Fri 11:50 a.m.  12:00 p.m.

ParetoEfficient Decision Agents for Offline MultiObjective Reinforcement Learning
(Contributed Talk)
SlidesLive Video » 
🔗 
Fri 12:00 p.m.  1:00 p.m.

Poster Session 2
(Poster Session)

🔗 
Fri 1:00 p.m.  1:30 p.m.

Reinforcement Learning and LTV at Spotify (Tony Jebara)
(Invited Talk)
SlidesLive Video » 
🔗 
Fri 1:30 p.m.  2:00 p.m.

Hybrid RL: Using Both Offline and Online Data Can Make RL Efficient (Wen Sun)
(Invited Talk)
SlidesLive Video » 
🔗 
Fri 2:00 p.m.  3:00 p.m.

Panel Discussion 2  Research
(Panel Discussion)
SlidesLive Video » Martha White (remote), Chelsea Finn (inperson), Wen Sun (inperson), Vincent Vanhoucke (remote) 
🔗 
Fri 3:00 p.m.  3:30 p.m.

Identification of Deadends in SafetyCritical Offline RL (Talyor Killian)
(Invited Talk)
SlidesLive Video » 
🔗 


AgentController Representations: Principled Offline RL with Rich Exogenous Information
(Poster)
link »
Learning to control an agent from data collected offline in a rich pixelbased visual observation space is vital for realworld applications of reinforcement learning (RL). A major challenge in this setting is the presence of input information that is hard to model and irrelevant to controlling the agent. This problem has been approached by the theoretical RL community through the lens of exogenous information, i.e, any controlirrelevant information contained in observations. For example, a robot navigating in busy streets needs to ignore irrelevant information, such as other people walking in the background, textures of objects, or birds in the sky. In this paper, we focus on the setting with visually detailed exogenous information, and introduce new offline RL benchmarks offering the ability to study this problem. We find that contemporary representation learning techniques can fail on datasets where the noise is a complex and time dependent process, which is prevalent in practical applications. To address these, we propose to use multistep inverse models, which have seen a great deal of interest in the RL theory community, to learn AgentController Representations for OfflineRL (ACRO). Despite being simple and requiring no reward, we show theoretically and empirically that the representation created by this objective greatly outperforms baselines. 
Riashat Islam · Manan Tomar · Alex Lamb · Hongyu Zang · Yonathan Efroni · Dipendra Misra · Aniket Didolkar · Xin Li · Harm Van Seijen · Remi Tachet des Combes · John Langford



ProtoValue Networks: Scaling Representation Learning with Auxiliary Tasks
(Poster)
link »
Auxiliary tasks improve the representations learned by deep reinforcement learning agents. Analytically, their effect is reasonably wellunderstood; in practice, however, their primary use remains in support of a main learning objective, rather than as a method for learning representations. This is perhaps surprising given that many auxiliary tasks are defined procedurally, and hence can be treated as an essentially infinite source of information about the environment. Based on this observation, we study the effectiveness of auxiliary tasks for learning rich representations, focusing on the setting where the number of tasks and the size of the agent’s network are simultaneously increased. For this purpose, we derive a new family of auxiliary tasks based on the successor measure. These tasks are easy to implement and have appealing theoretical properties. Combined with a suitable offpolicy learning rule, the result is a representation learning algorithm that can be understood as extending Mahadevan & Maggioni (2007)’s protovalue functions to deep reinforcement learning – accordingly, we call the resulting object protovalue networks. Through a series of experiments on the Arcade Learning Environment, we demonstrate that protovalue networks produce rich features that may be used to obtain performance comparable to established algorithms, using only linear approximation and a small number (~4M) of interactions with the environment’s reward function. 
Jesse Farebrother · Joshua Greaves · Rishabh Agarwal · Charline Le Lan · Ross Goroshin · Pablo Samuel Castro · Marc Bellemare 🔗 


ConfidenceConditioned Value Functions for Offline Reinforcement Learning
(Poster)
link »
Offline reinforcement learning (RL) promises the ability to learn effective policies solely using existing, static datasets, without any costly online interaction. To do so, offline RL methods must handle distributional shift between the dataset and the learned policy. The most common approach is to learn conservative, or lowerbound, value functions, which underestimate the return of outofdistribution (OOD) actions. However, such methods exhibit one notable drawback: policies optimized on such value functions can only behave according to a fixed, possibly suboptimal, degree of conservatism. However, this can be alleviated if we instead are able to learn policies for varying degrees of conservatism at training time and devise a method to dynamically choose one of them during evaluation. To do so, in this work, we propose learning value functions that additionally condition on the degree of conservatism, which we dub confidenceconditioned value functions. We derive a new form of a Bellman backup that simultaneously learns Qvalues for any degree of confidence with high probability. By conditioning on confidence, our value functions enable adaptive strategies during online evaluation by controlling for confidence level using the history of observations thus far. This approach can be implemented in practice by conditioning the Qfunction from existing conservative algorithms on the confidence. We theoretically show that our learned value functions produce conservative estimates of the true value at any desired confidence. Finally, we empirically show that our algorithm outperforms existing conservative offline RL algorithms on multiple discrete control domains. 
Joey Hong · Aviral Kumar · Sergey Levine 🔗 


Efficient Deep Reinforcement Learning Requires Regulating Statistical Overfitting
(Poster)
link »
Deep reinforcement learning algorithms that learn policies by trialanderror must learn from limited amounts of data collected by actively interacting with the environment. While many prior works have shown that proper regularization techniques are crucial for enabling dataefficient RL, a general understanding of the bottlenecks in dataefficient RL has remained unclear. Consequently, it has been difficult to devise a universal technique that works well across all domains. In this paper, we attempt to understand the primary bottleneck in sampleefficient deep RL by examining several potential hypotheses such as nonstationarity, excessive action distribution shift, and overfitting. We perform thorough empirical analysis on statebased DeepMind control suite (DMC) tasks in a controlled and systematic way to show that statistical overfitting on the temporaldifference (TD) error is the main culprit that severely affects the performance of deep RL algorithms, and prior methods that lead to good performance do in fact, control the amount of statistical overfitting. This observation gives us a robust principle for making deep RL efficient: we can hillclimb on a notion of validation temporaldifference error by utilizing any form of regularization techniques from supervised learning. We show that a simple online model selection method that targets the statistical overfitting issue is effective across statebased DMC and Gym tasks. 
Qiyang Li · Aviral Kumar · Ilya Kostrikov · Sergey Levine 🔗 


Domain Generalization for Robust ModelBased Offline RL
(Poster)
link »
SlidesLive Video » Existing offline reinforcement learning (RL) algorithms typically assume that training data is either: 1) generated by a known policy, or 2) of entirely unknown origin. We consider multidemonstrator offline RL, a middle ground where we know which demonstrators generated each dataset, but make no assumptions about the underlying policies of the demonstrators. This is the most natural setting when collecting data from multiple human operators, yet remains unexplored. Since different demonstrators induce different data distributions, we show that this can be naturally framed as a domain generalization problem, with each demonstrator corresponding to a different domain. Specifically, we propose DomainInvariant Modelbased Offline RL (DIMORL), where we apply Risk Extrapolation (REx) (Krueger et al., 2020) to the process of learning dynamics and rewards models. Our results show that models trained with REx exhibit improved domain generalization performance when compared with the natural baseline of pooling all demonstrators' data. We observe that the resulting models frequently enable the learning of superior policies in the offline modelbased RL setting, can improve the stability of the policy learning process, and potentially increase exploration. 
Alan Clark · Shoaib Siddiqui · Robert Kirk · Usman Anwar · Stephen Chung · David Krueger 🔗 


Squeezing more value out of your historical data: dataaugmented behavioural cloning as launchpad for reinforcement learning
(Poster)
link »
SlidesLive Video » In many realworld applications collecting large, highquality datasets may be too costly or impractical. Offline reinforcement learning (RL) aims to infer an optimal decisionmaking policy from a fixed set of data. Getting the most information from this dataset is then vital for good performance. We propose a modelbased data augmentation strategy, Trajectory Stitching (TS), to improve the quality of suboptimal trajectories. TS introduces unseen actions joining previously disconnected states: using a probabilistic notion of state reachability, it effectively `stitches' together parts of the historical demonstrations to generate new, higher quality ones. A stitching event consists of a transition between a pair of observed states through a synthetic and highly probable action. New actions are introduced only when they are expected to be beneficial, according to an estimated statevalue function. We show that using supervised learning, behavioural cloning (BC), to extract a decisionmaking policy from the new TS dataset, leads to improvements over the behaviourcloned policy from the original dataset. Improving over the BC policy could then be used as a launchpad for online RL through planning and demonstrationguided RL. 
Charles Hepburn · Giovanni Montana 🔗 


Keep Calm and Carry Offline: Policy refinement in offline reinforcement learning
(Poster)
link »
SlidesLive Video » The ability to discover optimal behaviour from fixed data sets has the potential to transfer the successes of reinforcement learning (RL) to domains where data collection is acutely problematic. In this offline setting a key challenge is overcoming overestimation bias for actions not present in data which, without the ability to correct for via interaction with the environment, can propagate and compound during training, thus leading to highly suboptimal policies. One simple method to reduce this bias is to introduce a policy constraint via behavioural cloning (BC), which encourages agents to pick actions closer to the source data. By finding the right balance between RL and BC such approaches have been shown to be surprisingly effective while requiring minimal changes to the underlying algorithms they are based on. To date, this balance has been held constant but in this work we explore the idea of tipping this balance towards RL following initial training. Using TD3BC we demonstrate that by continuing to train a policy offline while reducing the influence of the BC component we can produce refined policies that outperform the original baseline, as well as match or exceed the performance of more complex alternative approaches. Furthermore, we show these refined policies can be finetuned online while largely mitigating severe performance drops. 
Alex Beeson · Giovanni Montana 🔗 


Guiding Offline Reinforcement Learning Using a Safety Expert
(Poster)
link »
Offline reinforcement learning is used to train policies in situations where it is expensive or infeasible to access the environment during training. An agent trained under such a scenario does not get corrective feedback once the learned policy starts diverging and may fall prey to the overestimation bias commonly seen in this setting. This increases the chances of the agent choosing unsafe/risky actions, especially in states with sparse to no representation in the training dataset. In this paper, we propose to leverage a safety expert to discourage the offline RL agent from choosing unsafe actions in underrepresented states in the dataset. The proposed framework in this paper transfers the safety expert's knowledge in an offline setting for states with high uncertainty to prevent catastrophic failures from occurring in safetycritical domains. We use a simple but effective approach to quantify the state uncertainty based on how frequently they appear in a training dataset. In states with high uncertainty, the offline RL agent mimics the safety expert while maximizing the longterm reward. We modify TD3+BC, an existing offline RL algorithm, as a part of the proposed approach. We demonstrate empirically that our approach performs better than TD3+BC on some control tasks and comparably on others across two sets of benchmark datasets while reducing the chance of taking unsafe actions in sparse regions of the state space. 
Richa Verma · Kartik Bharadwaj · Harshad Khadilkar · Balaraman Ravindran 🔗 


ParetoEfficient Decision Agents for Offline MultiObjective Reinforcement Learning
(Poster)
link »
SlidesLive Video » The goal of multiobjective reinforcement learning (MORL) is to learn policies that simultaneously optimize multiple competing objectives. In practice, an agent's preferences over the objectives may not be known apriori, and hence, we require policies that can generalize to arbitrary preferences at test time. In this work, we propose a new datadriven setup for offline MORL, where we wish to learn a preferenceagnostic policy agent using only a finite dataset of offline demonstrations of other agents and their preferences. The key contributions of this work are twofold. First, we introduce D4MORL, (D)atasets for MORL that are specifically designed for offline settings. It contains 1.8 million annotated demonstrations obtained by rolling out reference policies that optimize for randomly sampled preferences on 6 MuJoCo environments with 23 objectives each. Second, we propose ParetoEfficient Decision Agents (PEDA), a family of offline MORL algorithms that builds and extends Decision Transformers via a novel preferenceandreturnconditioned policy. Empirically, we show that PEDA closely approximates the behavioral policy on the D4MORL benchmark and provides an excellent approximation of the Paretofront with appropriate conditioning, as measured by the hypervolume and sparsity metrics. 
Baiting Zhu · Meihua Dang · Aditya Grover 🔗 


Revisiting Bellman Errors for Offline Model Selection
(Poster)
link »
Applying offline reinforcement learning in realworld settings necessitates the ability to tune hyperparameters offline, a task known as $\textit{offline model selection}$. It is wellknown that the empirical Bellman errors are poor predictors of value function estimation accuracy and policy performance. This has led researchers to abandon model selection procedures based on Bellman errors and instead focus on evaluating the expected return under policies of interest. The problem with this approach is that it can be very difficult to use an offline dataset generated by one policy to estimate the expected returns of a different policy. In contrast, we argue that Bellman errors can be useful for offline model selection, and that the discouraging results in past literature has been due to estimating and utilizing them incorrectly. We propose a new algorithm, $\textit{Supervised Bellman Validation}$, that estimates the expected squared Bellman error better than the empirical Bellman errors. We demonstrate the relative merits of our method over competing methods through both theoretical results and empirical results on datasets from the Atari benchmark. We hope that our results will challenge current attitudes and spur future research into Bellman errors and their utility in offline model selection.

Joshua Zitovsky · Rishabh Agarwal · Daniel de Marchi · Michael Kosorok 🔗 


Boosting Offline Reinforcement Learning via Data Resampling
(Poster)
link »
Offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets. To address this problem, existing works mainly focus on designing sophisticated algorithms to explicitly or implicitly constrain the learned policy to be close to the behavior policy. The constraint applies not only to wellperforming actions but also to inferior ones, which limits the upper bound of the learned policy. Instead of aligning the densities of two distributions, aligning the supports gives a relaxed constraint while still being able to avoid outofdistribution actions. Therefore, we propose a simple yet effective method to boost offline RL algorithms based on the observation that resampling a dataset keeps the distribution support unchanged. More specifically, we construct a better behavior policy by resampling each transition in an old dataset according to its episodic return. We dub our method \name (Returnbased Data Rebalance), which can be implemented with less than 10 lines of code change and adds negligible running time. Extensive experiments demonstrate that \name is effective at boosting offline RL performance and orthogonal to decoupling strategies in longtailed classification. New stateofthearts are achieved on the D4RL benchmark. 
Yang Yue · Bingyi Kang · Xiao Ma · Zhongwen Xu · Gao Huang · Shuicheng Yan 🔗 


General policy mapping: online continual reinforcement learning inspired on the insect brain
(Poster)
link »
SlidesLive Video » We have developed a model for online continual reinforcement learning (RL) inspired on the insect brain. Our model leverages the offline training of a feature extraction and a common general policy layer to enable the convergence of RL algorithms in online settings. Sharing a common policy layer across tasks leads to positive backward transfer, where the agent continuously improved in older tasks sharing the same underlying general policy. Biologically inspired restrictions to the agent's network are key for the convergence of RL algorithms. This provides a pathway towards efficient online RL in resourceconstrained scenarios. 
Angel YanguasGil · Sandeep Madireddy 🔗 


Offline Reinforcement Learning with ClosedForm Policy Improvement Operators
(Poster)
link »
SlidesLive Video » Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning. By exploiting historical transitions, a policy is trained to maximize a learned value function while constrained by the behavior policy to avoid a significant distributional shift. In this paper, we propose our closedform policy improvement (CFPI) operators. We make a novel observation that the behavior constraint naturally motivates the use of firstorder Taylor approximation, leading to a linear approximation of the policy objective. Additionally, as practical datasets are usually collected by heterogeneous policies, we model the behavior policies as a Gaussian Mixture and overcome the induced optimization difficulties by leveraging the LogSumExp's lower bound and Jensen's Inequality, giving rise to our CFPI operators. We instantiate an offline RL algorithm with our novel policy improvement operator and empirically demonstrate its effectiveness over stateoftheart algorithms on the standard D4RL benchmark. 
Jiachen Li · Edwin Zhang · Ming Yin · Qinxun Bai · YuXiang Wang · William Yang Wang 🔗 


On and Offline Multiagent Reinforcement Learning for Disease Mitigation using Human Mobility Data
(Poster)
link »
The COVID19 pandemic generates new realworld datadriven problems such as predicting case surges, managing resource depletion, or modeling geospatial infection spreading. Though reinforcement learning (RL) has been previously proposed to optimize regional lockdowns, the availability of mobility tracking data with offline RL allows us to push decision making from the topdown perspective (i.e., driven by governments) to the bottom up perspective (i.e., driven by individuals). Rather than predicting the outcome of the outbreak, we utilize offline RL as a tool, along with epidemic modeling, to empower collaborative decisionmaking at the individual level. In our investigations, we ask whether we can train the population of a city to become more resilient against infectious diseases? To investigate, we deploy a 'city' of 10,000 agents loaded with real visits at Points of Interest (POIs) (e.g., restaurants, gyms, parks) throughout a target metropolitan area during the COVID19 pandemic (July 2020). Using a standard disease compartmental model, we find that the city of trained agents can reduce disease transmissions by 60%. This opens a new direction in using offline RL as a springboard to further the research at the intersection of artificial intelligence and disease mitigation. 
Sofia Hurtado · Radu Marculescu 🔗 


Contrastive ExampleBased Control
(Poster)
link »
While there are many realworld problems that might benefit from reinforcement learning, these problems rarely fit into the MDP mold: interacting with the environment is often prohibitively expensive and specifying reward functions is challenging. Motivated by these challenges, prior work has developed datadriven approaches that learn entirely from samples from the transition dynamics and examples of highreturn states. These methods typically learn a reward function from the highreturn states, use that reward function to label the transitions, and then apply an offline RL algorithm to these transitions. While these methods can achieve good results on many tasks, they can be complex, carefully regularizing the reward function and using temporal difference updates. In this paper, we propose a simple and scalable approach to offline examplebased control. Unlike prior approaches (e.g., ORIL, VICE, PURL) that learn a reward function, our method will learn an implicit model of multistep transitions. We show that this implicit model can represent the Qvalues for the examplebased control problem. Thus, whereas a learned reward function must be combined with an RL algorithm to determine good actions, our model can directly be used to determine these good actions. Across a range of statebased and imagebased offline control tasks, we find that our method outperforms baselines that use learned reward functions. 
Kyle Hatch · Sarthak J Shetty · Benjamin Eysenbach · Tianhe Yu · Rafael Rafailov · Russ Salakhutdinov · Sergey Levine · Chelsea Finn 🔗 


Offline Policy Evaluation for Reinforcement Learning with Adaptively Collected Data
(Poster)
link »
Offline RL is an important step towards making datahungry RL algorithms more widely usable in the real world, but conventional assumptions on the distribution of logging data do not apply in some key realworld scenarios. In particular, it is unrealistic to assume that RL practitioners will have access to sets of trajectories that simultaneously are mutually independent and explore well. We propose two natural ways to relax these assumptions: by allowing the data to be distributed according to different logging policies independently, and by allowing logging policies to depend on past trajectories. We discuss Offline Policy Evaluation (OPE) in these settings, analyzing the performance of a modelbased OPE estimator when the MDP is tabular. 
Sunil Madhow · Dan Qiao · YuXiang Wang 🔗 


Bridging the Gap Between Offline and Online Reinforcement Learning Evaluation Methodologies
(Poster)
link »
SlidesLive Video » Reinforcement learning (RL) has shown great promise with algorithms learning in environments with large state and action spaces purely from scalar reward signals. A crucial challenge for current deep RL algorithms is that they require a tremendous amount of environment interactions for learning. This can be infeasible in situations where such interactions are expensive; such as in robotics. Offline RL algorithms try to address this issue by bootstrapping the learning process from existing logged data without needing to interact with the environment from the very beginning. While online RL algorithms are typically evaluated as a function of the number of environment interactions, there exists no single established protocol for evaluating offline RL methods. In this paper, we propose a sequential approach to evaluate offline RL algorithms as a function of the training set size and thus by their data efficiency. Sequential evaluation provides valuable insights into the data efficiency of the learning process and the robustness of algorithms to distribution changes in the dataset while also harmonizing the visualization of the offline and online learning phases. Our approach is generally applicable and easy to implement. We compare several existing offline RL algorithms using this approach and present insights from a variety of tasks and offline datasets. 
Shivakanth Sujit · Pedro Braga · Jörg Bornschein · Samira Ebrahimi Kahou 🔗 


Offline Policy Comparison with Confidence: Benchmarks and Baselines
(Poster)
link »
Decision makers often wish to use offline historical data to compare sequentialaction policies at various world states. Importantly, computational tools should produce confidence values for such offline policy comparison (OPC) to account for statistical variance and limited data coverage. Nevertheless, there is little work that directly evaluates the quality of confidence values for OPC. In this work, we address this issue by creating benchmarks for OPC with Confidence (OPCC), derived by adding sets of policy comparison queries to datasets from offline reinforcement learning. In addition, we present an empirical evaluation of the \emph{risk versus coverage} tradeoff for a class of modelbased baselines. In particular, the baselines learn ensembles of dynamics models, which are used in various ways to produce simulations for answering queries with confidence values. While our results suggest advantages for certain baseline variations, there appears to be significant room for improvement in future work. 
Anurag Koul · Mariano Phielipp · Alan Fern 🔗 


Residual ModelBased Reinforcement Learning for Physical Dynamics
(Poster)
link »
SlidesLive Video » Dynamic control problems are a prevalent topic in robotics. Deep neural networks have been shown to learn accurately many complex dynamics, but these approaches remain datainefficient or intractable in some tasks. Rather than learning to reproduce the environment dynamics, traditional control approaches use some physical knowledge to describe the environment's evolution. These approaches do not need many samples to be tuned but suffer from approximations and are not adapted to strong modifications of the environment. In this paper, we introduce a method to learn the parameters of a physical model \ie the parameter of an Ordinary Differential Equation (ODE) to approach at best the observed transitions. This model is completed with a residual datadriven term in charge to reduce the reality gap between simple physical priors and complex environments. We also show that this approach can be naturally extended to the case of the finetuning of an implicit physical model trained on simple simulations. 
Zakariae EL ASRI · Clément Rambour · Vincent LE GUEN · Nicolas THOME 🔗 


Raisin: Residual Algorithms for Versatile Offline Reinforcement Learning
(Poster)
link »
The residual gradient algorithm (RG), gradient descent of the Mean Squared Bellman Error, brings robust convergence guarantees to bootstrapped value estimation. Meanwhile, the far more common semigradient algorithm (SG) suffers from wellknown instabilities and divergence. Unfortunately, RG often converges slowly in practice. Baird (1995) proposed residual algorithms (RA), weighted averaging of RG and SG, to combine RG's robust convergence and SG's speed. RA works moderately well in the online setting. We find, however, that RA works disproportionately well in the offline setting. Concretely, we find that merely adding a variable residual component to SAC increases its score on D4RL gym tasks by a median factor of 54. We further show that using the minimum of ten critics lets our algorithm match SAC$N$'s stateoftheart returns using 50$\times$ less compute and no additional hyperparameters. In contrast, TD3+BC with the same minimumoftencritics trick does not match SAC$N$'s returns on a handful of environments.

Braham Snyder · Yuke Zhu 🔗 


Collaborative symmetricity exploitation for offline learning of hardware design solver
(Poster)
link »
This paper proposes \textit{collaborative symmetricity exploitation} (\ourmethod{}) framework to train a solver for the decoupling capacitor placement problem (DPP), one of the significant hardware design problems. Due to the sequentially coupled multilevel property of the hardware design process, the design condition of DPP changes depending on the design of higherlevel problems. Also, the online evaluation of realworld electrical performance through simulation is extremely costly. Thus, we propose the \ourmethod{} framework that allows dataefficient offline learning of a DPP solver (i.e., contextualized policy) with high generalization capability over changing task conditions. Leveraging the symmetricity for offline learning of hardware design solver increases dataefficiency by reducing the solution space and improves generalization capability by capturing the invariant nature present regardless of changing conditions. Extensive experiments verified that \ourmethod{} with zeroshot inference outperforms the neural baselines and iterative conventional design methods on the DPP benchmark. Furthermore, \ourmethod{} greatly outperformed the expert method used to generate the offline data for training. 
HAEYEON KIM · Minsu Kim · joungho kim · Jinkyoo Park 🔗 


SPRINT: Scalable Semantic Policy Pretraining via Language Instruction Relabeling
(Poster)
link »
SlidesLive Video » We propose SPRINT, an approach for scalable offline policy pretraining based on natural language instructions. SPRINT pretrains an agent’s policy to execute a diverse set of semantically meaningful skills that it can leverage to learn new tasks faster. Prior work on offline pretraining required tedious manual definition of pretraining tasks or learned semantically meaningless skills via random goalreaching. Instead, our approach SPRINT (Scalable Pretraining via Relabeling Language INsTructions) leverages natural language instruction labels on offline agent experience, collected at scale (e.g., via crowdsourcing), to define a rich set of tasks with minimal human effort. Furthermore, by using natural language to define tasks, SPRINT can use pretrained large language models to automatically expand the initial task set. By relabeling and aggregating task instructions, even across multiple training trajectories, we can learn a large set of new skills during pretraining. In experiments using a realistic household simulator, we show that agents pretrained with SPRINT learn new longhorizon household tasks substantially faster than with previous pretraining approaches. 
Jesse Zhang · Karl Pertsch · Jiahui Zhang · Taewook Nam · Sung Ju Hwang · Xiang Ren · Joseph Lim 🔗 


Bayesian Qlearning With Imperfect Expert Demonstrations
(Poster)
link »
SlidesLive Video » Guided exploration with expert demonstrations improves data efficiency for reinforcement learning, but current algorithms often overuse expert information. We propose a novel algorithm to speed up Qlearning with the help of a limited amount of imperfect expert demonstrations. The algorithm avoids excessive reliance on expert data by relaxing the optimal expert assumption and gradually reducing the usage of uninformative expert data. Experimentally, we evaluate our approach on a sparsereward chain environment and six more complicated Atari games with delayed rewards. We can achieve better results with the proposed methods than Deep Qlearning from Demonstrations (Hester et al., 2017) in most environments. 
Fengdi Che · Xiru Zhu · Doina Precup · David Meger · Gregory Dudek 🔗 


Can Active Sampling Reduce Causal Confusion in Offline Reinforcement Learning?
(Poster)
link »
Causal confusion is a phenomenon where an agent learns a policy that reflects imperfect spurious correlations in the data. The resulting causally confused behaviors may appear desirable during training but may fail at deployment. This problem gets exacerbated in domains such as robotics with potentially large gaps between open and closedloop performance of an agent. In such cases, a causally confused model may appear to perform well according to openloop metrics but fail catastrophically when deployed in the real world. In this paper, we conduct the first study of causal confusion in offline reinforcement learning and hypothesize that selectively sampling data points that may help disambiguate the underlying causal mechanism of the environment may alleviate causal confusion. To investigate this hypothesis, we consider a set of simulated setups to study causal confusion and the ability of active sampling schemes to reduce its effects. We provide empirical evidence that random and active sampling schemes are able to consistently reduce causal confusion as training progresses and that active sampling is able to do so more efficiently than random sampling. 
Gunshi Gupta · Tim G. J. Rudner · Rowan McAllister · Adrien Gaidon · Yarin Gal 🔗 


Trajectorybased Explainability Framework for Offline RL
(Poster)
link »
Explanation is a key component for the adoption of reinforcement learning (RL) in many realworld decisionmaking problems. In the literature, the explanation is often provided by saliency attribution to the features of the RL agent's state. In this work, we propose a complementary approach to these explanations, particularly for offline RL, where we attribute the policy decisions of a trained RL agent to the trajectories encountered by it during training. To do so, we encode trajectories in offline training data individually as well as collectively (encoding a set of trajectories). We then attribute policy decisions to a set of trajectories in this encoded space by estimating the sensitivity of the decision with respect to that set. Further, we demonstrate the effectiveness of the proposed approach in terms of quality of attributions as well as practical scalability in diverse environments that involve both discrete and continuous state and action spaces such as gridworlds, video games (Atari) and continuous control (MuJoCo). 
Shripad Deshmukh · Arpan Dasgupta · Chirag Agarwal · Nan Jiang · Balaji Krishnamurthy · Georgios Theocharous · Jayakumar Subramanian 🔗 


AMORE: A Modelbased Framework for Improving Arbitrary Baseline Policies with Offline Data
(Poster)
link »
We propose a new modelbased offline RL framework, called Adversarial Models for Offline Reinforcement Learning (AMORE), which can robustly learn policies to improve upon an arbitrary baseline policy regardless of data coverage. Based on the concept of relative pessimism, AMORE is designed to optimize for the worstcase relative performance when facing uncertainty. In theory, we prove that the learned policy of AMORE never degrades the performance of the baseline policy with any admissible hyperparameter, and can learn to compete with the best policy within data coverage when the hyperparameter is well tuned and the baseline policy is supported by the data. Such a robust policy improvement property makes AMORE especially suitable for building realworld learning systems, because in practice ensuring no performance degradation is imperative before considering any benefit learning can bring. 
Tengyang Xie · Mohak Bhardwaj · Nan Jiang · ChingAn Cheng 🔗 


Balanced OffPolicy Evaluation for Personalized Pricing
(Poster)
link »
We consider a featurebased pricing problem, where we have data consisting of feature information, historical pricing decisions, and binary realized demand. We wish to evaluate a new personalized pricing policy that map features to prices. This problem is known as offpolicy evaluation and there is extensive literature on estimating the expected performance of the new policy. However, existing methods perform poorly when the logging policy has little exploration, which is common in pricing. We propose a novel method that exploits the special structure of pricing problems and incorporates downstream optimization problems when evaluating the new policy. We establish theoretical convergence guarantees, and we empirically demonstrate the advantage of our method using a real world pricing dataset. 
Adam N. Elmachtoub · Vishal Gupta · YUNFAN ZHAO 🔗 


ABC: Adversarial Behavioral Cloning for Offline ModeSeeking Imitation Learning
(Poster)
link »
Given a dataset of interactions with an environment of interest, a viable method to extract an agent policy is to estimate the maximum likelihood policy indicated by this data. This approach is commonly referred to as behavioral cloning (BC). In this work, we describe a key disadvantage of BC that arises due to the maximum likelihood objective function; namely that BC is meanseeking with respect to the stateconditional expert action distribution when the learner's policy is represented with a Gaussian. To address this issue, we develop a modified version of BC, Adversarial Behavioral Cloning (ABC), that exhibits modeseeking behavior by incorporating elements of GAN (generative adversarial network) training. We evaluate ABC on toy domains and a domain based on Hopper from the DeepMind Control suite, and show that it outperforms BC by being modeseeking in nature. 
Eddy Hudson · Ishan Durugkar · Garrett Warnell · Peter Stone 🔗 


DynamicsAugmented Decision Transformer for Offline Dynamics Generalization
(Poster)
link »
Recent progress in offline reinforcement learning (RL) has shown that it is often possible to train strong agents without potentially unsafe or impractical online interaction. However, in realworld settings, agents may encounter unseen environments with different dynamics, and generalization ability is required. This work presents DynamicsAugmented Decision Transformer (DADT), a simple yet efficient method to train generalizable agents from offline datasets; on top of returnconditioned policy using the transformer architecture, we improve generalization capabilities by using representation learning based on next state prediction. Our experimental results demonstrate that DADT outperforms prior stateoftheart methods for offline dynamics generalization. Intriguingly, DADT without finetuning even outperforms finetuned baselines. 
Changyeon Kim · Junsu Kim · Younggyo Seo · Kimin Lee · Honglak Lee · Jinwoo Shin 🔗 


Offline Reinforcement Learning on Real Robot with Realistic Data Sources
(Poster)
link »
SlidesLive Video » Offline Reinforcement Learning (ORL) provides a framework to train control policies from fixed suboptimal datasets, making it suitable for safetycritical applications like robotics. Despite significant algorithmic advances and benchmarking in simulation, the evaluation of ORL algorithms on realworld robot learning tasks has been limited. Since real robots are sensitive to details like sensor noises, reset conditions, demonstration sources, and test time distribution, it remains a question whether ORL is a competitive solution to real robotic challenges and what would characterize such tasks. We aim to address this deficiency through an empirical study of representative ORL algorithms on four tabletop manipulation tasks using a FrankaPanda robot arm. Our evaluation finds that for scenarios with sufficient indomain data of high quality, specialized ORL algorithms can be competitive with the behavior cloning approach. However, for scenarios that require outofdistribution generalization or task transfer, ORL algorithms can learn and generalize from offline heterogeneous datasets and outperform behavior cloning. Project URL: https://sites.google.com/view/realorlanon 
Gaoyue Zhou · Liyiming Ke · Siddhartha Srinivasa · Abhinav Gupta · Aravind Rajeswaran · Vikash Kumar 🔗 


Let Offline RL Flow: Training Conservative Agents in the Latent Space of Normalizing Flows
(Poster)
link »
SlidesLive Video » Offline reinforcement learning aims to train a policy on a prerecorded and fixed dataset without any additional environment interactions. There are two major challenges in this setting: (1) extrapolation error caused by approximating the value of stateaction pairs not wellcovered by the training data and (2) distributional shift between behavior and inference policies. One way to tackle these problems is to induce conservatism  i.e., keeping the learned policies closer to the behavioral ones. To achieve this, we build upon recent works on learning policies in latent action spaces and use a special form of Normalizing Flows for constructing a generative model, which we use as a conservative action encoder. This Normalizing Flows action encoder is pretrained in a supervised manner on the offline dataset, and then an additional policy model  controller in the latent space  is trained via reinforcement learning.This approach avoids querying actions outside of the training dataset and therefore does not require additional regularization for outofdataset actions. We evaluate our method on various locomotion and navigation tasks, demonstrating that our approach outperforms recently proposed algorithms with generative action models on a large portion of datasets. 
Dmitry Akimov · Alexander Nikulin · Vladislav Kurenkov · Denis Tarasov · Sergey Kolesnikov 🔗 


Matrix Estimation for Offline Evaluation in Reinforcement Learning with LowRank Structure
(Poster)
link »
We consider offline Reinforcement Learning (RL), where the agent does not interact with the environment and must rely on offline data collected using a behavior policy. Previous works provide policy evaluation guarantees when the target policy to be evaluated is covered by the behavior policy, that is, stateaction pairs visited by the target policy must also be visited by the behavior policy. We show that when the MDP has a latent lowrank structure, this coverage condition can be relaxed. Building on the connection to weighted matrix completion with nonuniform observations, we propose an offline policy evaluation algorithm that leverages the lowrank structure to estimate the values of uncovered stateaction pairs. Our algorithm does not require a known feature representation, and our finitesample error bound involves a novel discrepancy measure quantifying the discrepancy between the behavior and target policies in the spectral space. We provide concrete examples where our algorithm achieves accurate estimation while existing coverage conditions are not satisfied. 
Xumei Xi · Christina Yu · Yudong Chen 🔗 


Train Offline, Test Online: A Real Robot Learning Benchmark
(Poster)
link »
SlidesLive Video » Three challenges limit the progress of robot learning research: robots are expensive (few labs can participate), everyone uses different robots (findings do not generalize across labs), and we lack internetscale robotics data. We take on these challenges via a new benchmark: Train Offline, Test Online (TOTO). TOTO provides remote users with access to shared robots for evaluating methods on common tasks and an opensource dataset of these tasks for offline training. Its manipulation task suite requires challenging generalization to unseen objects, positions, and lighting. We present initial results on TOTO comparing five pretrained visual representations and four offline policy learning baselines, remotely contributed by five institutions. The real promise of TOTO, however, lies in the future: we release the benchmark for additional submissions from any user, enabling easy, direct comparison to several methods without the need to obtain hardware or collect data. 
Gaoyue Zhou · Victoria Dean · Mohan Kumar Srirama · Aravind Rajeswaran · Jyothish Pari · Kyle Hatch · Aryan Jain · Tianhe Yu · Pieter Abbeel · Lerrel Pinto · Chelsea Finn · Abhinav Gupta



Hybrid RL: Using both offline and online data can make RL efficient
(Poster)
link »
SlidesLive Video » We consider a hybrid reinforcement learning setting (Hybrid RL), in which an agent has access to an offline dataset and the ability to collect experience via realworld online interaction. The framework mitigates the challenges that arise in both pure offline and online RL settings, allowing for the design of simple and highly effective algorithms, in both theory and practice. We demonstrate these advantages by adapting the classical Q learning/iteration algorithm to the hybrid setting, which we call Hybrid QLearning or HyQ. In our theoretical results, we prove that the algorithm is both computationally and statistically efficient whenever the offline dataset supports a highquality policy and the environment has bounded bilinear rank. Notably, we require no assumptions on the coverage provided by the initial distribution, in contrast with guarantees for policy gradient/iteration methods. In our experimental results, we show that HyQ with neural network function approximation outperforms stateoftheart online, offline, and hybrid RL baselines on challenging benchmarks, including Montezuma’s Revenge. 
Yuda Song · Yifei Zhou · Ayush Sekhari · J. Bagnell · Akshay Krishnamurthy · Wen Sun 🔗 


Choreographer: Learning and Adapting Skills in Imagination
(Poster)
link »
SlidesLive Video » We present Choreographer, a modelbased agent that exploits its world model to learn and adapt skills in imagination. Choreographer is able to learn skills from offline unlabeled data and leverage them for effectively adapting to downstream tasks and for exploring the environment thoroughly, to find sparse rewards. Our method decouples the exploration and skill learning processes, being able to discover skills in the latent state space of the model. For adapting to downstream tasks, the agent uses a metacontroller to evaluate and adapt the learned skills efficiently by deploying them in parallel in imagination. Project website: https://doubleblindrepos.github.io/ 
Pietro Mazzaglia · Tim Verbelen · Bart Dhoedt · Alexandre Lacoste · Sai Rajeswar Mudumba 🔗 


CORL: Researchoriented Deep Offline Reinforcement Learning Library
(Poster)
link »
SlidesLive Video » CORL is an opensource library that provides singlefile implementations of Deep Offline Reinforcement Learning algorithms. It emphasizes a simple developing experience with a straightforward codebase and a modern analysis tracking tool. In CORL, we isolate methods implementation into distinct single files, making performancerelevant details easier to recognise. Additionally, an experiment tracking feature is available to help log metrics, hyperparameters, dependencies, and more to the cloud. Finally, we have ensured the reliability of the implementations by benchmarking a commonly employed D4RL benchmark. 
Denis Tarasov · Alexander Nikulin · Dmitry Akimov · Vladislav Kurenkov · Sergey Kolesnikov 🔗 


QEnsemble for Offline RL: Don't Scale the Ensemble, Scale the Batch Size
(Poster)
link »
SlidesLive Video » Training large neural networks is known to be timeconsuming, with the learning duration taking days or even weeks. To address this problem, largebatch optimization was introduced. This approach demonstrated that scaling minibatch sizes with appropriate learning rate adjustments can speed up the training process by orders of magnitude. While long training time was not typically a major issue for modelfree deep offline RL algorithms, recently introduced Qensemble methods achieving stateoftheart performance made this issue more relevant, notably extending the training duration. In this work, we demonstrate how this class of methods can benefit from largebatch optimization, which is commonly overlooked by the deep offline RL community. We show that scaling the minibatch size and naively adjusting the learning rate allows for (1) a reduced size of the Qensemble, (2) stronger penalization of outofdistribution actions, and (3) improved convergence time, effectively shortening training duration by 2.5x times on average. 
Alexander Nikulin · Vladislav Kurenkov · Denis Tarasov · Dmitry Akimov · Sergey Kolesnikov 🔗 


Offline Reinforcement Learning for Customizable Visual Navigation
(Poster)
link »
Robotic navigation often requires not only reaching a distant goal, but also satisfying intermediate user preferences on the path, such as obeying the rules of the road or preferring some surfaces over others. Our goal in this paper is to devise a robotic navigation system that can utilize previously collect data to learn navigational strategies that are responsive to userspecified utility functions, such as preferring specific surfaces or staying in sunlight (e.g., to maintain solar power). To this end, we show how offline reinforcement learning can be used to learn rewardspecific value functions for longhorizon navigation that can then be composed with planning methods to reach distant goals, while still remaining responsive to userspecified navigational preferences. This approach can utilize large amounts of previously collected data, which is relabeled with the task reward. This makes it possible to incorporate diverse data sources and enable effective generalization in the real world, without any simulation, taskspecific data collection, or demonstrations. We evaluate our system, ReViND, using a large navigational dataset from prior work, without any data collection specifically for the reward functions that we test. We demonstrate that our system can control a realworld ground robot to navigate to distant goals using only offline training from this dataset, and exhibit behaviors that qualitatively differ based on the userspecified reward function. 
Dhruv Shah · Arjun Bhorkar · Hrishit Leen · Ilya Kostrikov · Nicholas Rhinehart · Sergey Levine 🔗 


Efficient Planning in a Compact Latent Action Space
(Poster)
link »
Planningbased reinforcement learning has shown strong performance in tasks in discrete and lowdimensional continuous action spaces.However, planning usually brings significant computational overhead for decision making, so scaling such methods to highdimensional action spaces remains challenging. To advance efficient planning for highdimensional continuous control, we propose Trajectory Autoencoding Planner (TAP), which learns lowdimensional latent action codes from offline data. The decoder of the VQVAE thus serves as a novel dynamics model that takes latent actions and current state as input and reconstructs longhorizon trajectories. During inference time, given a starting state, TAP searches over discrete latent actions to find trajectories that have both high probability under the training distribution and high predicted cumulative reward. Empirical evaluation in the offline RL setting demonstrates low decision latency which is indifferent to the growing raw action dimensionality. For Adroit robotic hand manipulation tasks with highdimensional continuous action space, TAP surpasses existing modelbased methods by a large margin and also beats strong modelfree actorcritic baselines. 
zhengyao Jiang · Tianjun Zhang · Michael Janner · Yueying (Lisa) Li · Tim Rocktäschel · Edward Grefenstette · Yuandong Tian 🔗 


UserInteractive Offline Reinforcement Learning
(Poster)
link »
SlidesLive Video » Offline reinforcement learning algorithms are still not fully trusted by practitioners due to the risk that the learned policy performs worse than the original policy that generated the dataset or behaves in an unexpected way that is unfamiliar to the user. At the same time, offline RL algorithms are not able to tune their arguably most important hyperparameter  the proximity of the learned policy to the original policy. We propose an algorithm that allows the user to tune this hyperparameter at runtime, thereby addressing both of the above mentioned issues simultaneously. 
Phillip Swazinna · Steffen Udluft · Thomas Runkler 🔗 


Does ZeroShot Reinforcement Learning Exist?
(Poster)
link »
A zeroshot RL agent is an agent that can solve any RL task in a given environment, instantly with no additional planning or learning, after an initial rewardfree learning phase. This marks a shift from the rewardcentric RL paradigm towards controllable agents that can follow arbitrary instructions in an environment. Current RL agents can solve families of related tasks at best, or require planning anew for each task. Strategies for approximate zeroshot RL have been suggested using successor features (SFs) (Borsa et al., 2018) or forwardbackward (FB) representations (Touati & Ollivier, 2021), but testing has been limited. After clarifying the relationships between these schemes, we introduce improved losses and new SF models, and test the viability of zeroshot RL schemes systematically on tasks from the Unsupervised RL benchmark (Laskin et al., 2021). To disentangle universal representation learning from exploration, we work in an offline setting and repeat the tests on several existing replay buffers.SFs appear to suffer from the choice of the elementary state features. SFs with Laplacian eigenfunctions do well, while SFs based on autoencoders, inverse dynamics, transition models, lowrank transition matrix, contrastive learning, or diversity (APS), perform unconsistently. In contrast, FB representations jointly learn the elementary and successor features from a single, principled criterion. They perform best and consistently across the board, reaching $85 \%$ of supervised RL performance with a good replay buffer, in a zeroshot manner.

Ahmed Touati · Jérémy Rapin · Yann Ollivier 🔗 


State Advantage Weighting for Offline RL
(Poster)
link »
We present \textit{state advantage weighting} for offline reinforcement learning (RL). In contrast to action advantage $A(s,a)$ that we commonly adopt in QSA learning, we leverage state advantage $A(s,s^\prime)$ and QSS learning for offline RL, hence decoupling the action from values. We expect the agent can get to the highreward state and the action is determined by how the agent can get to that corresponding state. Experiments on D4RL datasets show that our proposed method can achieve remarkable performance against the common baselines. Furthermore, our method shows good generalization capability when transferring from offline to online.

Jiafei Lyu · aicheng Gong · Le Wan · Zongqing Lu · Xiu Li 🔗 


Optimal Transport for Offline Imitation Learning
(Poster)
link »
SlidesLive Video » With the advent of large datasets, offline reinforcement learning is a promising framework for learning good decisionmaking policies without the need to interact with the real environment.However, offline RL requires the dataset to be rewardannotated, which presents practical challenges when reward engineering is difficult or when obtaining reward annotations is laborintensive.In this paper, we introduce Optimal Transport Reward labeling (OTR), an algorithm that can assign rewards to offline trajectories, with a few highquality demonstrations. OTR's key idea is to use optimal transport to compute an optimal alignment between an unlabeled trajectory in the dataset and an expert demonstration to obtain a similarity measure that can be interpreted as a reward, which can then be used by an offline RL algorithm to learn the policy. OTR is easy to implement and computationally efficient. On D4RL benchmarks, we show that OTR with a single demonstration can consistently match the performance of offline RL with groundtruth rewards. 
Yicheng Luo · zhengyao Jiang · Samuel Cohen · Edward Grefenstette · Marc Deisenroth 🔗 


Control Graph as Unified IO for MorphologyTask Generalization
(Poster)
link »
SlidesLive Video » The rise of generalist largescale models in natural language and vision has made us expect that a massive datadriven approach could achieve broader generalization in other domains such as continuous control. In this work, we explore a method for learning a single policy that manipulates various forms of agents to solve various tasks by distilling a large amount of proficient behavioral data. In order to align inputoutput (IO) interface among multiple tasks and diverse agent morphologies while preserving essential 3D geometric relations, we introduce control graph, which treats observations, actions and goals/task in a unified graph representation. We also develop MxTBench for fast largescale behavior generation, which supports procedural generation of diverse morphologytask combinations with a minimal blueprint and hardwareaccelerated simulator. Through efficient representation and architecture selection on MxTBench, we find out that a control graph representation coupled with Transformer architecture improves the multitask performances compared to other baselines including recent discrete tokenization, and provides better prior knowledge for zeroshot transfer or sample efficiency in downstream multitask imitation learning. Our work suggests large diverse offline datasets, unified IO representation, and policy representation and architecture selection through supervised learning form a promising approach for studying and advancing morphology task generalization. 
Hiroki Furuta · Yusuke Iwasawa · Yutaka Matsuo · Shixiang (Shane) Gu 🔗 


Mutual Information Regularized Offline Reinforcement Learning
(Poster)
link »
Offline reinforcement learning (RL) aims at learning an effective policy from offline datasets without active interactions with the environment. The major challenge of offline RL is the distribution shift that appears when outofdistribution actions are queried, which makes the policy improvement direction biased by extrapolation errors. Most existing methods address this problem by penalizing the policy for deviating from the behavior policy during policy improvement or making conservative updates for value functions during policy evaluation. In this work, we propose a novel MISA framework to approach offline RL from the perspective of Mutual Information between States and Actions in the dataset by directly constraining the policy improvement direction. Intuitively, mutual information measures the mutual dependence of actions and states, which reflects how a behavior agent reacts to certain environment states during data collection. To effectively utilize this information to facilitate policy learning, MISA constructs lower bounds of mutual information parameterized by the policy and Qvalues. We show that optimizing this lower bound is equivalent to maximizing the likelihood of a onestep improved policy on the offline dataset. In this way, we constrain the policy improvement direction to lie in the data manifold. The resulting algorithm simultaneously augments the policy evaluation and improvement by adding a mutual information regularization. MISA is a general offline RL framework that unifies conservative Qlearning (CQL) and behavior regularization methods (e.g., TD3+BC) as special cases. Our experiments show that MISA performs significantly better than existing methods and achieves new stateoftheart on various tasks of the D4RL benchmark. 
Xiao Ma · Bingyi Kang · Zhongwen Xu · Min Lin · Shuicheng Yan 🔗 


UncertaintyDriven Pessimistic QEnsemble for OfflinetoOnline Reinforcement Learning
(Poster)
link »
Reusing existing offline reinforcement learning (RL) agents is an emerging topic for reducing the dominant computational cost for exploration in many settings. To effectively finetune the pretrained offline policies, both offline samples and online interactions may be leveraged. In this paper, we propose the idea of incorporating a pessimistic Qensemble and an uncertainty quantification technique to effectively finetune offline agents. To stabilize online Qfunction estimates during finetuning, the proposed method uses uncertainty estimation as a penalization for a replay buffer with a mixture of online interactions from the ensemble agent and offline samples from the behavioral policies. In various robotic tasks on D4RL benchmark, we show that our method outperforms the stateoftheart algorithms in terms of the average return and the sample efficiency. 
Ingook Jang · Seonghyun Kim 🔗 


Offline Robot Reinforcement Learning with UncertaintyGuided Human Expert Sampling
(Poster)
link »
Recent advances in batch (offline) reinforcement learning have shown promising results towards learning from available offline data and proved offline RL to be an essential toolkit in learning control policies in a modelfree setting. An offline reinforcement learning algorithm applied to a dataset collected by a suboptimal nonlearningbased algorithm can result in a policy that outperforms the behavior agent used to collect the data. Such a scenario is frequent in robotics, where existing automation is collecting operational data. Although offline learning techniques can learn from data generated by a suboptimal behavior agent, there is still an opportunity to improve the sample complexity of existing offline RL algorithms by strategically introducing human demonstration data into the training process. To this end, we propose a novel approach that uses uncertainty estimation to trigger the injection of human demonstration data and guide policy training towards optimal behavior while reducing overall sample complexity. Our experiments show that this approach is more sample efficient when compared to a naive way of combining expert data with data collected from a suboptimal agent. We augmented an existing offline reinforcement learning algorithm Conservative QLearning (CQL) with our approach and performed experiments on data collected from MuJoCo and OffWorld Gym learning environments. 
Ashish Kumar · Ilya Kuzovkin 🔗 


NearOptimal Deployment Efficiency in RewardFree Reinforcement Learning with Linear Function Approximation
(Poster)
link »
We study the problem of deployment efficient reinforcement learning (RL) with linear function approximation under the \emph{rewardfree} exploration setting. This is a wellmotivated problem because deploying new policies is costly in reallife RL applications. Under the linear MDP setting with feature dimension $d$ and planning horizon $H$, we propose a new algorithm that collects at most $\widetilde{O}(\frac{d^2H^5}{\epsilon^2})$ trajectories within $H$ deployments to identify $\epsilon$optimal policy for any (possibly datadependent) choice of reward functions. To the best of our knowledge, our approach is the first to achieve optimal deployment complexity and optimal $d$ dependence in sample complexity at the same time, even if the reward is known ahead of time. Our novel techniques include an explorationpreserving policy discretization and a generalized Goptimal experiment design, which could be of independent interest.

Dan Qiao · YuXiang Wang 🔗 


Towards Universal Visual Reward and Representation via ValueImplicit PreTraining
(Poster)
link »
SlidesLive Video »
Reward and representation learning are two longstanding challenges for learning an expanding set of robot manipulation skills from sensory observations. Given the inherent cost and scarcity of indomain, taskspecific robot data, learning from large, diverse, offline human videos has emerged as a promising path towards acquiring a generally useful visual representation for control; however, how these human videos can be used for generalpurpose reward learning remains an open question. We introduce $\textbf{V}$alue$\textbf{I}$mplicit $\textbf{P}$retraining (VIP), a selfsupervised pretrained visual representation capable of generating dense and smooth reward functions for unseen robotic tasks. VIP casts representation learning from human videos as an offline goalconditioned reinforcement learning problem and derives a selfsupervised dual goalconditioned valuefunction objective that does not depend on actions, enabling pretraining on unlabeled human videos. Theoretically, VIP can be understood as a novel implicit time contrastive objective that generates a temporally smooth embedding, enabling the value function to be implicitly defined via the embedding distance, which can then be used to construct the reward for any goalimage specified downstream task. Trained on largescale Ego4D human videos and without any finetuning on indomain, taskspecific data, VIP's frozen representation can provide dense visual reward for an extensive set of simulated and realrobot tasks, enabling diverse rewardbased visual control methods and significantly outperforming all prior pretrained representations. Notably, VIP can enable simple, fewshot offline RL on a suite of realworld robot tasks with as few as 20 trajectories.

Jason Yecheng Ma · Shagun Sodhani · Dinesh Jayaraman · Osbert Bastani · Vikash Kumar · Amy Zhang 🔗 


Imitation from Observation With Bootstrapped Contrastive Learning
(Poster)
link »
Imitation from observation is a paradigm that consists of training agents using observations of expert demonstrations without direct access to the actions. Depending on the problem configuration, these demonstrations can be sequences of states or raw visual observations.One of the most common procedures adopted to solve this problem is to train a reward function from the demonstrations, but this task still remains a significant challenge.We approach this problem with a method of agent behavior representation in a latent space using demonstration videos.Our approach exploits recent algorithms of contrastive learning of image and video and uses a bootstrapping method to progressively train a trajectory encoding function with respect to the variation of the agent policy. This function is then used to compute the rewards provided to a standard Reinforcement Learning (RL) algorithm.Our method uses only a limited number of videos produced by an expert and we do not have access to the expert policy function.Our experiments show promising results on a set of continuous control tasks and demonstrate that learning a behavior encoder from videos allows building an efficient reward function for the agent. 
Medric Sonwa · Johanna Hansen · Eugene Belilovsky 🔗 


Provable Benefits of Representational Transfer in Reinforcement Learning
(Poster)
link »
We study the problem of representational transfer in RL, where an agent first pretrains offline in a number of source tasks to discover a shared representation, which is subsequently used to learn a good policy online in a target task. We propose a new notion of task relatedness between source and target tasks and develop a novel approach for representational transfer under this assumption. Concretely, we show that given generative access to a set of source tasks, we can discover a representation, using which subsequent linear RL techniques quickly converge to a nearoptimal policy, with only online access to the target task. The sample complexity is close to knowing the ground truth features in the target task and comparable to prior representation learning results in the source tasks. We complement our positive results with lower bounds without generative access and validate our findings with empirical evaluation on rich observation MDPs that requires deep exploration. 
Alekh Agarwal · Yuda Song · Kaiwen Wang · Mengdi Wang · Wen Sun · Xuezhou Zhang 🔗 


A Connection between OneStep Regularization and Critic Regularization in Reinforcement Learning
(Poster)
link »
As with any machine learning problem with limited data, effective offline RL algorithms require careful regularization to avoid overfitting. Onestep methods perform regularization by doing just a single step of policy improvement, while critic regularization methods do many steps of policy improvement with a regularized objective. These methods appear distinct. Onestep methods, such as advantageweighted regression and conditional behavioral cloning, truncate policy iteration after just one step. This ``early stopping'' makes onestep RL simple and stable, but can limit its asymptotic performance. Critic regularization typically requires more compute but has appealing lowerbound guarantees. In this paper, we draw a close connection between these methods: applying a multistep critic regularization method with a regularization coefficient of 1 yields the same policy as onestep RL. While practical implementations violate our assumptions and critic regularization is typically applied with smaller regularization coefficients, our experiments nevertheless show that our analysis makes accurate, testable predictions about practical offline RL methods (CQL and onestep RL) with commonlyused hyperparameters. Our results that every problem can be solved with a single step of policy improvement, but rather that onestep RL might be competitive with critic regularization on RL problems that demand strong regularization. 
Benjamin Eysenbach · Matthieu Geist · Sergey Levine · Russ Salakhutdinov 🔗 


Offline evaluation in RL: soft stability weighting to combine fitted Qlearning and modelbased methods
(Poster)
link »
The goal of offline policy evaluation (OPE) is to evaluate target policies based on logged data under a different distribution. Because no one method is uniformly best, model selection is important, but difficult without online exploration. We propose soft stability weighting (SSW) for adaptively combining offline estimates from ensembles of fittedQevaluation (FQE) and modelbased evaluation methods generated by different random initializations of neural networks. Soft stability weighting computes a stateactionconditional weighted average of the median FQE and modelbased prediction by normalizing the stateactionconditional standard deviation of ensembles of both methods relative to the average standard deviation of each method. Therefore it compares the relative stability of predictions in the ensemble to the perturbations from random initializations, drawn from a truncated normal distribution scaled by the input feature size. 
Briton Park · Xian Wu · Bin Yu · Angela Zhou 🔗 


Using Confounded Data in Offline RL
(Poster)
link »
In this work we consider the problem of confounding in offline RL, also called the delusion problem. While it is known that learning from purely offline data is a hazardous endeavor in the presence of confounding, in this paper we show that offline, confounded data can be safely combined with online, nonconfounded data to improve the sampleefficiency of modelbased RL. We import ideas from the wellestablished framework of $do$calculus to express modelbased RL as a causal inference problem, thus bridging the fields of RL and causality. We propose a latentbased method which we prove is correct and efficient, in the sense that it attains better generalization guarantees thanks to the offline, confounded data (in the asymptotic case), regardless of the expert's behavior. We illustrate the effectiveness of our method on a series of synthetic experiments.

Maxime Gasse · Damien GRASSET · Guillaume Gaudron · PierreYves Oudeyer 🔗 


Hierarchical Abstraction for Combinatorial Generalization in Object Rearrangement
(Poster)
link »
SlidesLive Video » Object rearrangement is a challenge for embodied agents because solving these tasks requires generalizing across a combinatorially large set of underlying entities that take the value of object states. Worse, these entities are often unknown and must be inferred from sensory percepts. We present a hierarchical abstraction approach to uncover these underlying entities and achieve combinatorial generalization from unstructured inputs. By constructing a factorized transition graph over clusters of object representations inferred from pixels, we show how to learn a correspondence between intervening on states of entities in the agent's model and acting on objects in the environment. We use this correspondence to develop a method for control that generalizes to different numbers and configurations of objects, which outperforms current offline deep RL methods when evaluated on a set of simulated rearrangement and stacking tasks. 
Michael Chang · Alyssa L Dayan · Franziska Meier · Tom Griffiths · Sergey Levine · Amy Zhang 🔗 


Visual Backtracking Teleoperation: A Data Collection Protocol for Offline ImageBased RL
(Poster)
link »
SlidesLive Video » We consider how to most efficiently leverage teleoperator time to collect data for learning robust imagebased value functions and policies for sparse reward robotic tasks. To accomplish this goal, we modify the process of data collection to include more than just successful demonstrations of the desired task. Instead we develop a novel protocol that we call Visual Backtracking Teleoperation (VBT), which deliberately collects a dataset of visually similar failures, recoveries, and successes. VBT data collection is particularly useful for efficiently learning accurate value functions from small datasets of imagebased observations. We demonstrate VBT on a real robot to perform continuous control from image observations for the deformable manipulation task of Tshirt grasping. We find that by adjusting the data collection process we improve the quality of both the learned value functions and policies over a variety of baseline methods for data collection. Specifically, we find that offline reinforcement learning on VBT data outperforms standard behavior cloning on successful demonstration data by 13% when both methods are given equalsized datasets of 60 minutes of data from the real robot. 
David Brandfonbrener · Stephen Tu · Avi Singh · Stefan Welker · Chad Boodoo · Nikolai Matni · Jake Varley 🔗 


Towards DataDriven Offline Simulations for Online Reinforcement Learning
(Poster)
link »
Modern decisionmaking systems, from robots to web recommendation engines, are expected to adapt: to user preferences, changing circumstances or even new tasks. Yet, it is still uncommon to deploy a dynamically learning agent (rather than a fixed policy) to a production system, as it's perceived as unsafe. Using historical data to reason about learning algorithms, similar to offline policy evaluation (OPE) applied to fixed policies, could help practitioners evaluate and ultimately deploy such adaptive agents to production. In this work, we formalize offline learner simulation (OLS) for reinforcement learning (RL) and propose a novel evaluation protocol that measures both fidelity and efficiency. For environments with complex highdimensional observations, we propose a semiparametric approach that leverages recent advances in latent state discovery. In preliminary experiments, we show the advantage of our approach compared to fully nonparametric baselines. 
Shengpu Tang · Felipe Vieira Frujeri · Dipendra Misra · Alex Lamb · John Langford · Paul Mineiro · Sebastian Kochman 🔗 


Scaling Marginalized Importance Sampling to HighDimensional StateSpaces via State Abstraction
(Poster)
link »
We consider the problem of offpolicy evaluation (OPE) in reinforcement learning (RL), where the goal is to estimate the performance of an evaluation policy, $\pi_e$, using a fixed dataset, $\mathcal{D}$, collected by one or more policies that may be different from $\pi_e$. Current OPE algorithms may produce poor OPE estimates under policy distribution shift i.e., when the probability of a particular stateaction pair occurring under $\pi_e$ is very different from the probability of that same pair occurring in $\mathcal{D}$ (Voloshin et al. 2021, Fu et al. 2021). In this work, we propose to improve the accuracy of OPE estimation by projecting the ground statespace into a lowerdimensional statespace using concepts from the state abstraction literature in RL. Specifically, we consider marginalized importance sampling (MIS) OPE algorithms which compute distribution correction ratios to produce their OPE estimate. In the original statespace, these ratios may have high variance which may lead to high variance OPE. However, we prove that in the lowerdimensional abstract statespace the ratios can have lower variance resulting in lower variance OPE. We then present a minimax optimization problem that incorporates the state abstraction. Finally, our empirical evaluation on difficult, highdimensional statespace OPE tasks shows that the abstract ratios can make MIS OPE estimators achieve lower meansquared error and more robust to hyperparameter tuning than the ground ratios.

Brahma Pavse · Josiah Hanna 🔗 


Benchmarking Offline Reinforcement Learning Algorithms for ECommerce Order Fraud Evaluation
(Poster)
link »
Amazon and other ecommerce sites must employ mechanisms to protect their millions of customers from fraud, such as unauthorized use of credit cards. One such mechanism is order fraud evaluation, where systems evaluate orders for fraud risk, and either “pass” the order, or take an action to mitigate high risk. Order fraud evaluation systems typically use binary classification models that distinguish fraudulent and legitimate orders, to assess risk and take action. We seek to devise a system that considers both financial losses of fraud and longterm customer satisfaction, which may be impaired when incorrect actions are applied to legitimate customers. We propose that taking actions to optimize longterm impact can be formulated as a Reinforcement Learning (RL) problem. Standard RL methods require online interaction with an environment to learn, but this is not desirable in highstakes applications like order fraud evaluation. Offline RL algorithms learn from logged data collected from the environment, without the need for online interaction, making them suitable for our use case. We show that offline RL methods outperform traditional binary classification solutions in SimStore, a simplified ecommerce simulation that incorporates order fraud risk. We also propose a novel approach to training offline RL policies that adds a new loss term during training, to better align policy exploration with taking correct actions. 
Soysal Degirmenci · Christopher S Jones 🔗 


Sparse QLearning: Offline Reinforcement Learning with Implicit Value Regularization
(Poster)
link »
Most offline reinforcement learning (RL) methods suffer from the tradeoff between improving the policy to surpass the behavior policy and constraining the policy to limit the deviation from the behavior policy as computing Qvalues using outofdistribution actions will suffer from errors due to distributional shift. The recent proposed \textit{Insample Learning} paradigm (e.g., IQL), which improves the policy by quantile regression using only data samples, shows great promise because it learns an optimal policy without querying the value function of any unseen actions. However, it remains unclear how this type of method handles the distributional shift in learning the value function. In this work, we make a key finding that the insample learning paradigm arises under the \textit{Implicit Value Regularization} (IVR) framework. This gives a deeper understanding of why the insample learning paradigm works, i.e., it applies implicit value regularization to the policy. Based on the IVR framework, we further propose a practical algorithm, which uses the same value regularization as CQL, but in a complete insample manner. Compared with IQL, we find that our algorithm introduces sparsity in learning the value function, we thus dub our method Sparse Qlearning (SQL). We verify the effectiveness of SQL on D4RL benchmark datasets. We also show the benefits of sparsity by comparing SQL with IQL in noisy data regimes and show the robustness of insample learning by comparing SQL with CQL in small data regimes. Under all settings, SQL achieves better results and owns faster convergence compared to other baselines. 
Haoran Xu · Li Jiang · Li Jianxiong · Zhuoran Yang · Zhaoran Wang · Xianyuan Zhan 🔗 
Author Information
Aviral Kumar (UC Berkeley)
Rishabh Agarwal (Google Research, Brain Team)
My research work mainly revolves around deep reinforcement learning (RL), often with the goal of making RL methods suitable for realworld problems, and includes an outstanding paper award at NeurIPS.
Aravind Rajeswaran (FAIR)
Wenxuan Zhou (CMU)
George Tucker (Google Brain)
Doina Precup (McGill University / Mila / DeepMind Montreal)
Aviral Kumar (UC Berkeley)
More from the Same Authors

2021 Spotlight: Neural Additive Models: Interpretable Machine Learning with Neural Nets »
Rishabh Agarwal · Levi Melnick · Nicholas Frosst · Xuezhou Zhang · Ben Lengerich · Rich Caruana · Geoffrey Hinton 
2021 : Data Sharing without Rewards in MultiTask Offline Reinforcement Learning »
Tianhe Yu · Aviral Kumar · Yevgen Chebotar · Chelsea Finn · Sergey Levine · Karol Hausman 
2021 : Should I Run Offline Reinforcement Learning or Behavioral Cloning? »
Aviral Kumar · Joey Hong · Anikait Singh · Sergey Levine 
2021 : DR3: ValueBased Deep Reinforcement Learning Requires Explicit Regularization »
Aviral Kumar · Rishabh Agarwal · Tengyu Ma · Aaron Courville · George Tucker · Sergey Levine 
2021 : Offline Policy Selection under Uncertainty »
Mengjiao (Sherry) Yang · Bo Dai · Ofir Nachum · George Tucker · Dale Schuurmans 
2021 : CIC: Contrastive Intrinsic Control for Unsupervised Skill Discovery »
Misha Laskin · Hao Liu · Xue Bin Peng · Denis Yarats · Aravind Rajeswaran · Pieter Abbeel 
2021 : Behavior Predictive Representations for Generalization in Reinforcement Learning »
Siddhant Agarwal · Aaron Courville · Rishabh Agarwal 
2021 : Behavioral Priors and Dynamics Models: Improving Performance and Domain Transfer in Offline RL »
Catherine Cang · Aravind Rajeswaran · Pieter Abbeel · Misha Laskin 
2021 : SingleShot Pruning for Offline Reinforcement Learning »
Samin Yeasar Arnob · · Sergey Plis · Doina Precup 
2021 : Importance of Empirical Sample Complexity Analysis for Offline Reinforcement Learning »
Samin Yeasar Arnob · Riashat Islam · Doina Precup 
2022 : A Novel Stochastic Gradient Descent Algorithm for LearningPrincipal Subspaces »
Charline Le Lan · Joshua Greaves · Jesse Farebrother · Mark Rowland · Fabian Pedregosa · Rishabh Agarwal · Marc Bellemare 
2022 : The Paradox of Choice: On the Role of Attention in Hierarchical Reinforcement Learning »
Andrei Nica · Khimya Khetarpal · Doina Precup 
2022 : ProtoValue Networks: Scaling Representation Learning with Auxiliary Tasks »
Jesse Farebrother · Joshua Greaves · Rishabh Agarwal · Charline Le Lan · Ross Goroshin · Pablo Samuel Castro · Marc Bellemare 
2022 : Offline Qlearning on Diverse MultiTask Data Both Scales And Generalizes »
Aviral Kumar · Rishabh Agarwal · XINYANG GENG · George Tucker · Sergey Levine 
2022 : PreTraining for Robots: Leveraging Diverse Multitask Data via Offline Reinforcement Learning »
Aviral Kumar · Anikait Singh · Frederik Ebert · Yanlai Yang · Chelsea Finn · Sergey Levine 
2022 : Offline Reinforcement Learning from Heteroskedastic Data Via Support Constraints »
Anikait Singh · Aviral Kumar · Quan Vuong · Yevgen Chebotar · Sergey Levine 
2022 : MultiEnvironment Pretraining Enables Transfer to Action Limited Datasets »
David Venuto · Mengjiao (Sherry) Yang · Pieter Abbeel · Doina Precup · Igor Mordatch · Ofir Nachum 
2022 : Imitation Is Not Enough: Robustifying Imitation with Reinforcement Learning for Challenging Driving Scenarios »
Yiren Lu · Yiren Lu · Yiren Lu · Justin Fu · George Tucker · Xinlei Pan · Eli Bronstein · Rebecca Roelofs · Benjamin Sapp · Brandyn White · Aleksandra Faust · Shimon Whiteson · Dragomir Anguelov · Sergey Levine 
2022 : Train Offline, Test Online: A Real Robot Learning Benchmark »
Gaoyue Zhou · Victoria Dean · Mohan Kumar Srirama · Aravind Rajeswaran · Jyothish Pari · Kyle Hatch · Aryan Jain · Tianhe Yu · Pieter Abbeel · Lerrel Pinto · Chelsea Finn · Abhinav Gupta 
2022 : Real World Offline Reinforcement Learning with Realistic Data Source »
Gaoyue Zhou · Liyiming Ke · Siddhartha Srinivasa · Abhinav Gupta · Aravind Rajeswaran · Vikash Kumar 
2022 : Train Offline, Test Online: A Real Robot Learning Benchmark »
Gaoyue Zhou · Victoria Dean · Mohan Kumar Srirama · Aravind Rajeswaran · Jyothish Pari · Kyle Hatch · Aryan Jain · Tianhe Yu · Pieter Abbeel · Lerrel Pinto · Chelsea Finn · Abhinav Gupta 
2022 : ProtoValue Networks: Scaling Representation Learning with Auxiliary Tasks »
Jesse Farebrother · Joshua Greaves · Rishabh Agarwal · Charline Le Lan · Ross Goroshin · Pablo Samuel Castro · Marc Bellemare 
2022 : ConfidenceConditioned Value Functions for Offline Reinforcement Learning »
Joey Hong · Aviral Kumar · Sergey Levine 
2022 : Efficient Deep Reinforcement Learning Requires Regulating Statistical Overfitting »
Qiyang Li · Aviral Kumar · Ilya Kostrikov · Sergey Levine 
2022 : Revisiting Bellman Errors for Offline Model Selection »
Joshua Zitovsky · Rishabh Agarwal · Daniel de Marchi · Michael Kosorok 
2022 : Bayesian Qlearning With Imperfect Expert Demonstrations »
Fengdi Che · Xiru Zhu · Doina Precup · David Meger · Gregory Dudek 
2022 : Offline Reinforcement Learning on Real Robot with Realistic Data Sources »
Gaoyue Zhou · Liyiming Ke · Siddhartha Srinivasa · Abhinav Gupta · Aravind Rajeswaran · Vikash Kumar 
2022 : Train Offline, Test Online: A Real Robot Learning Benchmark »
Gaoyue Zhou · Victoria Dean · Mohan Kumar Srirama · Aravind Rajeswaran · Jyothish Pari · Kyle Hatch · Aryan Jain · Tianhe Yu · Pieter Abbeel · Lerrel Pinto · Chelsea Finn · Abhinav Gupta 
2022 : Complete the Missing Half: Augmenting Aggregation Filtering with Diversification for Graph Convolutional Networks »
Sitao Luan · Mingde Zhao · Chenqing Hua · XiaoWen Chang · Doina Precup 
2022 : Revisiting Bellman Errors for Offline Model Selection »
Joshua Zitovsky · Daniel de Marchi · Rishabh Agarwal · Michael Kosorok 
2022 : Train Offline, Test Online: A Real Robot Learning Benchmark »
Gaoyue Zhou · Victoria Dean · Mohan Kumar Srirama · Aravind Rajeswaran · Jyothish Pari · Kyle Hatch · Aryan Jain · Tianhe Yu · Pieter Abbeel · Lerrel Pinto · Chelsea Finn · Abhinav Gupta 
2022 : Offline Reinforcement Learning on Real Robot with Realistic Data Sources »
Gaoyue Zhou · Liyiming Ke · Siddhartha Srinivasa · Abhinav Gupta · Aravind Rajeswaran · Vikash Kumar 
2022 : ConfidenceConditioned Value Functions for Offline Reinforcement Learning »
Joey Hong · Aviral Kumar · Sergey Levine 
2022 : ProtoValue Networks: Scaling Representation Learning with Auxiliary Tasks »
Jesse Farebrother · Joshua Greaves · Rishabh Agarwal · Charline Le Lan · Ross Goroshin · Pablo Samuel Castro · Marc Bellemare 
2022 : Bayesian Qlearning With Imperfect Expert Demonstrations »
Fengdi Che · Xiru Zhu · Doina Precup · David Meger · Gregory Dudek 
2022 : Efficient Deep Reinforcement Learning Requires Regulating Statistical Overfitting »
Qiyang Li · Aviral Kumar · Ilya Kostrikov · Sergey Levine 
2022 : PreTraining for Robots: Leveraging Diverse Multitask Data via Offline Reinforcement Learning »
Anikait Singh · Aviral Kumar · Frederik Ebert · Yanlai Yang · Chelsea Finn · Sergey Levine 
2022 : Offline Reinforcement Learning from Heteroskedastic Data Via Support Constraints »
Anikait Singh · Aviral Kumar · Quan Vuong · Yevgen Chebotar · Sergey Levine 
2022 : Policy Architectures for Compositional Generalization in Control »
Allan Zhou · Vikash Kumar · Chelsea Finn · Aravind Rajeswaran 
2022 : MoDem: Accelerating Visual ModelBased Reinforcement Learning with Demonstrations »
Nicklas Hansen · Yixin Lin · Hao Su · Xiaolong Wang · Vikash Kumar · Aravind Rajeswaran 
2022 : Investigating Multitask Pretraining and Generalization in Reinforcement Learning »
Adrien Ali Taiga · Rishabh Agarwal · Jesse Farebrother · Aaron Courville · Marc Bellemare 
2022 : Train Offline, Test Online: A Real Robot Learning Benchmark »
Gaoyue Zhou · Victoria Dean · Mohan Kumar Srirama · Aravind Rajeswaran · Jyothish Pari · Kyle Hatch · Aryan Jain · Tianhe Yu · Pieter Abbeel · Lerrel Pinto · Chelsea Finn · Abhinav Gupta 
2022 : Real World Offline Reinforcement Learning with Realistic Data Source »
Gaoyue Zhou · Liyiming Ke · Siddhartha Srinivasa · Abhinav Gupta · Aravind Rajeswaran · Vikash Kumar 
2022 : Ilya Kostrikov, Aviral Kumar »
Ilya Kostrikov · Aviral Kumar 
2022 : Offline Qlearning on Diverse MultiTask Data Both Scales And Generalizes »
Aviral Kumar · Rishabh Agarwal · XINYANG GENG · George Tucker · Sergey Levine 
2022 Spotlight: Lightning Talks 3B3 »
Sitao Luan · Zhiyuan You · Ruofan Liu · Linhao Qu · Yuwei Fu · Jiaxi Wang · Chunyu Wei · Jian Liang · xiaoyuan luo · Di Wu · Yun Lin · Lei Cui · Ji Wu · Chenqing Hua · Yujun Shen · Qincheng Lu · XIANGLIN YANG · Benoit Boulet · Manning Wang · Di Liu · Lei Huang · Fei Wang · Kai Yang · Jiaqi Zhu · Jin Song Dong · Zhijian Song · Xin Lu · Mingde Zhao · Shuyuan Zhang · Yu Zheng · XiaoWen Chang · Xinyi Le · Doina Precup 
2022 Spotlight: Revisiting Heterophily For Graph Neural Networks »
Sitao Luan · Chenqing Hua · Qincheng Lu · Jiaqi Zhu · Mingde Zhao · Shuyuan Zhang · XiaoWen Chang · Doina Precup 
2022 : Simulating Human Gaze with Neural Visual Attention »
Leo Schwinn · Doina Precup · Bjoern Eskofier · Dario Zanca 
2022 : Train Offline, Test Online: A Real Robot Learning Benchmark »
Gaoyue Zhou · Victoria Dean · Mohan Kumar Srirama · Aravind Rajeswaran · Jyothish Pari · Kyle Hatch · Aryan Jain · Tianhe Yu · Pieter Abbeel · Lerrel Pinto · Chelsea Finn · Abhinav Gupta 
2022 : Democratizing RL Research by Reusing Prior Computation »
Rishabh Agarwal 
2022 : Train Offline, Test Online: A Real Robot Learning Benchmark »
Gaoyue Zhou · Victoria Dean · Mohan Kumar Srirama · Aravind Rajeswaran · Jyothish Pari · Kyle Hatch · Aryan Jain · Tianhe Yu · Pieter Abbeel · Lerrel Pinto · Chelsea Finn · Abhinav Gupta 
2022 : Simulating Human Gaze with Neural Visual Attention »
Leo Schwinn · Doina Precup · Bjoern Eskofier · Dario Zanca 
2022 Poster: Oracle Inequalities for Model Selection in Offline Reinforcement Learning »
Jonathan N Lee · George Tucker · Ofir Nachum · Bo Dai · Emma Brunskill 
2022 Poster: Reincarnating Reinforcement Learning: Reusing Prior Computation to Accelerate Progress »
Rishabh Agarwal · Max Schwarzer · Pablo Samuel Castro · Aaron Courville · Marc Bellemare 
2022 Poster: Revisiting Heterophily For Graph Neural Networks »
Sitao Luan · Chenqing Hua · Qincheng Lu · Jiaqi Zhu · Mingde Zhao · Shuyuan Zhang · XiaoWen Chang · Doina Precup 
2022 Poster: DASCO: DualGenerator Adversarial Support Constrained Offline Reinforcement Learning »
Quan Vuong · Aviral Kumar · Sergey Levine · Yevgen Chebotar 
2022 Poster: Unsupervised Reinforcement Learning with Contrastive Intrinsic Control »
Michael Laskin · Hao Liu · Xue Bin Peng · Denis Yarats · Aravind Rajeswaran · Pieter Abbeel 
2022 Poster: DataDriven ModelBased Optimization via Invariant Representation Learning »
Han Qi · Yi Su · Aviral Kumar · Sergey Levine 
2022 Poster: Continuous MDP Homomorphisms and Homomorphic Policy Gradient »
Sahand RezaeiShoshtari · Rosie Zhao · Prakash Panangaden · David Meger · Doina Precup 
2021 : Speaker Intro »
Aviral Kumar · George Tucker 
2021 : Speaker Intro »
Aviral Kumar · George Tucker 
2021 : Retrospective Panel »
Sergey Levine · Nando de Freitas · Emma Brunskill · Finale DoshiVelez · Nan Jiang · Rishabh Agarwal 
2021 : Invited Speaker Panel »
Sham Kakade · Minmin Chen · Philip Thomas · Angela Schoellig · Barbara Engelhardt · Doina Precup · George Tucker 
2021 : Speaker Intro »
Rishabh Agarwal · Aviral Kumar 
2021 : Speaker Intro »
Rishabh Agarwal · Aviral Kumar 
2021 Workshop: Offline Reinforcement Learning »
Rishabh Agarwal · Aviral Kumar · George Tucker · Justin Fu · Nan Jiang · Doina Precup · Aviral Kumar 
2021 : Opening Remarks »
Rishabh Agarwal · Aviral Kumar 
2021 : Behavior Predictive Representations for Generalization in Reinforcement Learning »
Siddhant Agarwal · Aaron Courville · Rishabh Agarwal 
2021 : DataDriven Offline Optimization for Architecting Hardware Accelerators »
Aviral Kumar · Amir Yazdanbakhsh · Milad Hashemi · Kevin Swersky · Sergey Levine 
2021 : DR3: ValueBased Deep Reinforcement Learning Requires Explicit Regularization Q&A »
Aviral Kumar · Rishabh Agarwal · Tengyu Ma · Aaron Courville · George Tucker · Sergey Levine 
2021 : DR3: ValueBased Deep Reinforcement Learning Requires Explicit Regularization »
Aviral Kumar · Rishabh Agarwal · Tengyu Ma · Aaron Courville · George Tucker · Sergey Levine 
2021 Poster: Visual Adversarial Imitation Learning using Variational Models »
Rafael Rafailov · Tianhe Yu · Aravind Rajeswaran · Chelsea Finn 
2021 Poster: COMBO: Conservative Offline ModelBased Policy Optimization »
Tianhe Yu · Aviral Kumar · Rafael Rafailov · Aravind Rajeswaran · Sergey Levine · Chelsea Finn 
2021 Poster: Decision Transformer: Reinforcement Learning via Sequence Modeling »
Lili Chen · Kevin Lu · Aravind Rajeswaran · Kimin Lee · Aditya Grover · Misha Laskin · Pieter Abbeel · Aravind Srinivas · Igor Mordatch 
2021 Poster: Coupled Gradient Estimators for Discrete Latent Variables »
Zhe Dong · Andriy Mnih · George Tucker 
2021 Oral: Deep Reinforcement Learning at the Edge of the Statistical Precipice »
Rishabh Agarwal · Max Schwarzer · Pablo Samuel Castro · Aaron Courville · Marc Bellemare 
2021 Poster: Neural Additive Models: Interpretable Machine Learning with Neural Nets »
Rishabh Agarwal · Levi Melnick · Nicholas Frosst · Xuezhou Zhang · Ben Lengerich · Rich Caruana · Geoffrey Hinton 
2021 Poster: Conservative Data Sharing for MultiTask Offline Reinforcement Learning »
Tianhe Yu · Aviral Kumar · Yevgen Chebotar · Karol Hausman · Sergey Levine · Chelsea Finn 
2021 Poster: Reinforcement Learning with Latent Flow »
Wenling Shang · Xiaofei Wang · Aravind Srinivas · Aravind Rajeswaran · Yang Gao · Pieter Abbeel · Misha Laskin 
2021 Poster: Deep Reinforcement Learning at the Edge of the Statistical Precipice »
Rishabh Agarwal · Max Schwarzer · Pablo Samuel Castro · Aaron Courville · Marc Bellemare 
2021 Poster: Why Generalization in RL is Difficult: Epistemic POMDPs and Implicit Partial Observability »
Dibya Ghosh · Jad Rahme · Aviral Kumar · Amy Zhang · Ryan Adams · Sergey Levine 
2020 : DesignBench: Benchmarks for DataDriven Offline ModelBased Optimization »
Brandon Trabucco · Aviral Kumar · XINYANG GENG · Sergey Levine 
2020 : Conservative Objective Models: A Simple Approach to Effective ModelBased Optimization »
Brandon Trabucco · Aviral Kumar · XINYANG GENG · Sergey Levine 
2020 : Closing remarks »
Raymond Chua · Feryal Behbahani · Julie J Lee · Rui Ponte Costa · Doina Precup · Blake Richards · Ida Momennejad 
2020 : Invited Talk #7 QnA  Yael Niv »
Yael Niv · Doina Precup · Raymond Chua · Feryal Behbahani 
2020 : Contributed Talk 5: Latent Action Space for Offline Reinforcement Learning »
Wenxuan Zhou 
2020 : Speaker Introduction: Yael Niv »
Doina Precup · Raymond Chua · Feryal Behbahani 
2020 : Contributed Talk #3: Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning »
Rishabh Agarwal · Marlos C. Machado · Pablo Samuel Castro · Marc Bellemare 
2020 : Panel »
Emma Brunskill · Nan Jiang · Nando de Freitas · Finale DoshiVelez · Sergey Levine · John Langford · Lihong Li · George Tucker · Rishabh Agarwal · Aviral Kumar 
2020 Workshop: Offline Reinforcement Learning »
Aviral Kumar · Rishabh Agarwal · George Tucker · Lihong Li · Doina Precup · Aviral Kumar 
2020 : Introduction »
Aviral Kumar · George Tucker · Rishabh Agarwal 
2020 : Panel Discussions »
Grace Lindsay · George Konidaris · Shakir Mohamed · Kimberly Stachenfeld · Peter Dayan · Yael Niv · Doina Precup · Catherine Hartley · Ishita Dasgupta 
2020 Workshop: Biological and Artificial Reinforcement Learning »
Raymond Chua · Feryal Behbahani · Julie J Lee · Sara Zannone · Rui Ponte Costa · Blake Richards · Ida Momennejad · Doina Precup 
2020 : Organizers Opening Remarks »
Raymond Chua · Feryal Behbahani · Julie J Lee · Ida Momennejad · Rui Ponte Costa · Blake Richards · Doina Precup 
2020 : Keynote: Doina Precup »
Doina Precup 
2020 Workshop: Object Representations for Learning and Reasoning »
William Agnew · Rim Assouel · Michael Chang · Antonia Creswell · Eliza Kosoy · Aravind Rajeswaran · Sjoerd van Steenkiste 
2020 Poster: Model Inversion Networks for ModelBased Optimization »
Aviral Kumar · Sergey Levine 
2020 Poster: Reward Propagation Using Graph Convolutional Networks »
Martin Klissarov · Doina Precup 
2020 Spotlight: Reward Propagation Using Graph Convolutional Networks »
Martin Klissarov · Doina Precup 
2020 Poster: RL Unplugged: A Suite of Benchmarks for Offline Reinforcement Learning »
Caglar Gulcehre · Ziyu Wang · Alexander Novikov · Thomas Paine · Sergio Gómez · Konrad Zolna · Rishabh Agarwal · Josh Merel · Daniel Mankowitz · Cosmin Paduraru · Gabriel DulacArnold · Jerry Li · Mohammad Norouzi · Matthew Hoffman · Nicolas Heess · Nando de Freitas 
2020 Poster: DisARM: An Antithetic Gradient Estimator for Binary Latent Variables »
Zhe Dong · Andriy Mnih · George Tucker 
2020 Spotlight: DisARM: An Antithetic Gradient Estimator for Binary Latent Variables »
Zhe Dong · Andriy Mnih · George Tucker 
2020 Poster: Conservative QLearning for Offline Reinforcement Learning »
Aviral Kumar · Aurick Zhou · George Tucker · Sergey Levine 
2020 Tutorial: (Track3) Offline Reinforcement Learning: From Algorithm Design to Practical Applications Q&A »
Sergey Levine · Aviral Kumar 
2020 Poster: One Solution is Not All You Need: FewShot Extrapolation via Structured MaxEnt RL »
Saurabh Kumar · Aviral Kumar · Sergey Levine · Chelsea Finn 
2020 Poster: An Equivalence between Loss Functions and NonUniform Sampling in Experience Replay »
Scott Fujimoto · David Meger · Doina Precup 
2020 Poster: Forethought and Hindsight in Credit Assignment »
Veronica Chelu · Doina Precup · Hado van Hasselt 
2020 Poster: MOReL: ModelBased Offline Reinforcement Learning »
Rahul Kidambi · Aravind Rajeswaran · Praneeth Netrapalli · Thorsten Joachims 
2020 Poster: DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction »
Aviral Kumar · Abhishek Gupta · Sergey Levine 
2020 Spotlight: DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction »
Aviral Kumar · Abhishek Gupta · Sergey Levine 
2020 Tutorial: (Track3) Offline Reinforcement Learning: From Algorithm Design to Practical Applications »
Sergey Levine · Aviral Kumar 
2019 : Panel Session: A new hope for neuroscience »
Yoshua Bengio · Blake Richards · Timothy Lillicrap · Ila Fiete · David Sussillo · Doina Precup · Konrad Kording · Surya Ganguli 
2019 : Poster Presentations »
Rahul Mehta · Andrew Lampinen · Binghong Chen · Sergio PascualDiaz · Jordi GrauMoya · Aldo Faisal · Jonathan Tompson · Yiren Lu · Khimya Khetarpal · Martin Klissarov · PierreLuc Bacon · Doina Precup · Thanard Kurutach · Aviv Tamar · Pieter Abbeel · Jinke He · Maximilian Igl · Shimon Whiteson · Wendelin Boehmer · Raphaël Marinier · Olivier Pietquin · Karol Hausman · Sergey Levine · Chelsea Finn · Tianhe Yu · Lisa Lee · Benjamin Eysenbach · Emilio Parisotto · Eric Xing · Ruslan Salakhutdinov · Hongyu Ren · Anima Anandkumar · Deepak Pathak · Christopher Lu · Trevor Darrell · Alexei Efros · Phillip Isola · Feng Liu · Bo Han · Gang Niu · Masashi Sugiyama · Saurabh Kumar · Janith Petangoda · Johan Ferret · James McClelland · Kara Liu · Animesh Garg · Robert Lange 
2019 : Poster Session »
Matthia Sabatelli · Adam Stooke · Amir Abdi · Paulo Rauber · Leonard Adolphs · Ian Osband · Hardik Meisheri · Karol Kurach · Johannes Ackermann · Matt Benatan · GUO ZHANG · Chen Tessler · Dinghan Shen · Mikayel Samvelyan · Riashat Islam · Murtaza Dalal · Luke Harries · Andrey Kurenkov · Konrad Żołna · Sudeep Dasari · Kristian Hartikainen · Ofir Nachum · Kimin Lee · Markus Holzleitner · Vu Nguyen · Francis Song · Christopher Grimm · Felipe Leno da Silva · Yuping Luo · Yifan Wu · Alex Lee · Thomas Paine · WeiYang Qu · Daniel Graves · Yannis FletBerliac · Yunhao Tang · Suraj Nair · Matthew Hausknecht · Akhil Bagaria · Simon Schmitt · Bowen Baker · Paavo Parmas · Benjamin Eysenbach · Lisa Lee · Siyu Lin · Daniel Seita · Abhishek Gupta · Riley SimmonsEdler · Yijie Guo · Kevin Corder · Vikash Kumar · Scott Fujimoto · Adam Lerer · Ignasi Clavera Gilaberte · Nicholas Rhinehart · Ashvin Nair · Ge Yang · Lingxiao Wang · Sungryull Sohn · J. Fernando HernandezGarcia · Xian Yeow Lee · Rupesh Srivastava · Khimya Khetarpal · Chenjun Xiao · Luckeciano Carvalho Melo · Rishabh Agarwal · Tianhe Yu · Glen Berseth · Devendra Singh Chaplot · Jie Tang · Anirudh Srinivasan · Tharun Kumar Reddy Medini · Aaron Havens · Misha Laskin · Asier Mujika · Rohan Saphal · Joseph Marino · Alex Ray · Joshua Achiam · Ajay Mandlekar · Zhuang Liu · Danijar Hafner · Zhiwen Tang · Ted Xiao · Michael Walton · Jeff Druce · Ferran Alet · ZhangWei Hong · Stephanie Chan · Anusha Nagabandi · Hao Liu · Hao Sun · Ge Liu · Dinesh Jayaraman · John CoReyes · Sophia Sanborn 
2019 : Contributed Talks »
Rishabh Agarwal · Adam Gleave · Kimin Lee 
2019 : Poster Spotlight 2 »
Aaron Sidford · Mengdi Wang · Lin Yang · Yinyu Ye · Zuyue Fu · Zhuoran Yang · Yongxin Chen · Zhaoran Wang · Ofir Nachum · Bo Dai · Ilya Kostrikov · Dale Schuurmans · Ziyang Tang · Yihao Feng · Lihong Li · Denny Zhou · Qiang Liu · Rodrigo Toro Icarte · Ethan Waldie · Toryn Klassen · Rick Valenzano · Margarita Castro · Simon Du · Sham Kakade · Ruosong Wang · Minshuo Chen · Tianyi Liu · Xingguo Li · Zhaoran Wang · Tuo Zhao · Philip Amortila · Doina Precup · Prakash Panangaden · Marc Bellemare 
2019 : Panel Discussion »
Richard Sutton · Doina Precup 
2019 : Poster and Coffee Break 1 »
Aaron Sidford · Aditya Mahajan · Alejandro Ribeiro · Alex Lewandowski · Ali H Sayed · Ambuj Tewari · Angelika Steger · Anima Anandkumar · Asier Mujika · Hilbert J Kappen · Bolei Zhou · Byron Boots · Chelsea Finn · ChenYu Wei · Chi Jin · ChingAn Cheng · Christina Yu · Clement Gehring · Craig Boutilier · Dahua Lin · Daniel McNamee · Daniel Russo · David Brandfonbrener · Denny Zhou · Devesh Jha · Diego Romeres · Doina Precup · Dominik Thalmeier · Eduard Gorbunov · Elad Hazan · Elena Smirnova · Elvis Dohmatob · Emma Brunskill · Enrique Munoz de Cote · Ethan Waldie · Florian Meier · Florian Schaefer · Ge Liu · Gergely Neu · Haim Kaplan · Hao Sun · Hengshuai Yao · Jalaj Bhandari · James A Preiss · Jayakumar Subramanian · Jiajin Li · Jieping Ye · Jimmy Smith · Joan Bas Serrano · Joan Bruna · John Langford · Jonathan Lee · Jose A. ArjonaMedina · Kaiqing Zhang · Karan Singh · Yuping Luo · Zafarali Ahmed · Zaiwei Chen · Zhaoran Wang · Zhizhong Li · Zhuoran Yang · Ziping Xu · Ziyang Tang · Yi Mao · David Brandfonbrener · Shirli DiCastro · Riashat Islam · Zuyue Fu · Abhishek Naik · Saurabh Kumar · Benjamin Petit · Angeliki Kamoutsi · Simone Totaro · Arvind Raghunathan · Rui Wu · Donghwan Lee · Dongsheng Ding · Alec Koppel · Hao Sun · Christian Tjandraatmadja · Mahdi Karami · Jincheng Mei · Chenjun Xiao · Junfeng Wen · Zichen Zhang · Ross Goroshin · Mohammad Pezeshki · Jiaqi Zhai · Philip Amortila · Shuo Huang · Mariya Vasileva · El houcine Bergou · Adel Ahmadyan · Haoran Sun · Sheng Zhang · Lukas Gruber · Yuanhao Wang · Tetiana Parshakova 
2019 : Invited Talk: Hierarchical Reinforcement Learning: Computational Advances and Neuroscience Connections »
Doina Precup 
2019 : Panel Discussion led by Grace Lindsay »
Grace Lindsay · Blake Richards · Doina Precup · Jacqueline Gottlieb · Jeff Clune · Jane Wang · Richard Sutton · Angela Yu · Ida Momennejad 
2019 : Poster Session »
Ahana Ghosh · Javad Shafiee · Akhilan Boopathy · Alex Tamkin · Theodoros Vasiloudis · Vedant Nanda · Ali Baheri · Paul Fieguth · Andrew Bennett · Guanya Shi · Hao Liu · Arushi Jain · Jacob Tyo · Benjie Wang · Boxiao Chen · Carroll Wainwright · Chandramouli Shama Sastry · Chao Tang · Daniel S. Brown · David Inouye · David Venuto · Dhruv Ramani · Dimitrios Diochnos · Divyam Madaan · Dmitrii Krashenikov · Joel Oren · Doyup Lee · Eleanor Quint · elmira amirloo · Matteo Pirotta · Gavin Hartnett · Geoffroy DubourgFelonneau · Gokul Swamy · PinYu Chen · Ilija Bogunovic · Jason Carter · Javier GarciaBarcos · Jeet Mohapatra · Jesse Zhang · Jian Qian · John Martin · Oliver Richter · Federico Zaiter · TsuiWei Weng · Karthik Abinav Sankararaman · Kyriakos Polymenakos · Lan Hoang · mahdieh abbasi · Marco Gallieri · Mathieu Seurin · Matteo Papini · Matteo Turchetta · Matthew Sotoudeh · Mehrdad Hosseinzadeh · Nathan Fulton · Masatoshi Uehara · Niranjani Prasad · OanaMaria Camburu · Patrik Kolaric · Philipp Renz · Prateek Jaiswal · Reazul Hasan Russel · Riashat Islam · Rishabh Agarwal · Alexander Aldrick · Sachin Vernekar · Sahin Lale · Sai Kiran Narayanaswami · Samuel Daulton · Sanjam Garg · Sebastian East · Shun Zhang · Soheil Dsidbari · Justin Goodwin · Victoria Krakovna · Wenhao Luo · Wesley Chung · Yuanyuan Shi · YuhShyang Wang · Hongwei Jin · Ziping Xu 
2019 : Opening Remarks »
Raymond Chua · Feryal Behbahani · Sara Zannone · Rui Ponte Costa · Claudia Clopath · Doina Precup · Blake Richards 
2019 Workshop: Biological and Artificial Reinforcement Learning »
Raymond Chua · Sara Zannone · Feryal Behbahani · Rui Ponte Costa · Claudia Clopath · Blake Richards · Doina Precup 
2019 Poster: Stabilizing OffPolicy QLearning via Bootstrapping Error Reduction »
Aviral Kumar · Justin Fu · George Tucker · Sergey Levine 
2019 Poster: Graph Normalizing Flows »
Jenny Liu · Aviral Kumar · Jimmy Ba · Jamie Kiros · Kevin Swersky 
2019 Poster: EnergyInspired Models: Learning with SamplerInduced Distributions »
Dieterich Lawson · George Tucker · Bo Dai · Rajesh Ranganath 
2019 Poster: Don't Blame the ELBO! A Linear VAE Perspective on Posterior Collapse »
James Lucas · George Tucker · Roger Grosse · Mohammad Norouzi 
2019 Poster: Break the Ceiling: Stronger Multiscale Deep Graph Convolutional Networks »
Sitao Luan · Mingde Zhao · XiaoWen Chang · Doina Precup 
2019 Poster: MetaLearning with Implicit Gradients »
Aravind Rajeswaran · Chelsea Finn · Sham Kakade · Sergey Levine 
2018 : Spotlights »
Guangneng Hu · Ke Li · Aviral Kumar · Phi Vu Tran · Samuel G. Fadel · Rita Kuznetsova · BongNam Kang · Behrouz Haji Soleimani · Jinwon An · Nathan de Lara · Anjishnu Kumar · Tillman Weyde · Melanie Weber · Kristen Altenburger · Saeed Amizadeh · Xiaoran Xu · Yatin Nandwani · Yang Guo · Maria Pacheco · William Fedus · Guillaume Jaume · Yuka Yoneda · Yunpu Ma · Yunsheng Bai · Berk Kapicioglu · Maximilian Nickel · Fragkiskos Malliaros · Beier Zhu · Aleksandar Bojchevski · Joshua Joseph · Gemma Roig · Esma Balkir · Xander Steenbrugge 
2018 Poster: SampleEfficient Reinforcement Learning with Stochastic Ensemble Value Expansion »
Jacob Buckman · Danijar Hafner · George Tucker · Eugene Brevdo · Honglak Lee 
2018 Oral: SampleEfficient Reinforcement Learning with Stochastic Ensemble Value Expansion »
Jacob Buckman · Danijar Hafner · George Tucker · Eugene Brevdo · Honglak Lee 
2018 Poster: Temporal Regularization for Markov Decision Process »
Pierre Thodoroff · Audrey Durand · Joelle Pineau · Doina Precup 
2018 Poster: Learning Safe Policies with Expert Guidance »
Jessie Huang · Fa Wu · Doina Precup · Yang Cai 
2017 : Panel Discussion »
Matt Botvinick · Emma Brunskill · Marcos Campos · Jan Peters · Doina Precup · David Silver · Josh Tenenbaum · Roy Fox 
2017 : Progress on Deep Reinforcement Learning with Temporal Abstraction (Doina Precup) »
Doina Precup 
2017 : Doina Precup »
Doina Precup 
2017 Workshop: Hierarchical Reinforcement Learning »
Andrew G Barto · Doina Precup · Shie Mannor · Tom Schaul · Roy Fox · Carlos Florensa 
2017 Poster: REBAR: Lowvariance, unbiased gradient estimates for discrete latent variable models »
George Tucker · Andriy Mnih · Chris J Maddison · John Lawson · Jascha SohlDickstein 
2017 Oral: REBAR: Lowvariance, unbiased gradient estimates for discrete latent variable models »
George Tucker · Andriy Mnih · Chris J Maddison · John Lawson · Jascha SohlDickstein 
2017 Poster: Filtering Variational Objectives »
Chris Maddison · John Lawson · George Tucker · Nicolas Heess · Mohammad Norouzi · Andriy Mnih · Arnaud Doucet · Yee Teh 
2017 Poster: Towards Generalization and Simplicity in Continuous Control »
Aravind Rajeswaran · Kendall Lowrey · Emanuel Todorov · Sham Kakade 
2016 Workshop: The Future of Interactive Machine Learning »
Kory Mathewson @korymath · Kaushik Subramanian · Mark Ho · Robert Loftin · Joseph L Austerweil · Anna Harutyunyan · Doina Precup · Layla El Asri · Matthew Gombolay · Jerry Zhu · Sonia Chernova · Charles Isbell · Patrick M Pilarski · WengKeen Wong · Manuela Veloso · Julie A Shah · Matthew Taylor · Brenna Argall · Michael Littman 
2015 Poster: Data Generation as Sequential Decision Making »
Philip Bachman · Doina Precup 
2015 Spotlight: Data Generation as Sequential Decision Making »
Philip Bachman · Doina Precup 
2015 Poster: Basis refinement strategies for linear value function approximation in MDPs »
Gheorghe Comanici · Doina Precup · Prakash Panangaden 
2014 Workshop: From Bad Models to Good Policies (Sequential Decision Making under Uncertainty) »
OdalricAmbrym Maillard · Timothy A Mann · Shie Mannor · Jeremie Mary · Laurent Orseau · Thomas Dietterich · Ronald Ortner · Peter Grünwald · Joelle Pineau · Raphael Fonteneau · Georgios Theocharous · Esteban D Arcaute · Christos Dimitrakakis · Nan Jiang · Doina Precup · PierreLuc Bacon · Marek Petrik · Aviv Tamar 
2014 Poster: Optimizing Energy Production Using Policy Search and Predictive State Representations »
Yuri Grinberg · Doina Precup · Michel Gendreau 
2014 Poster: Learning with PseudoEnsembles »
Philip Bachman · Ouais Alsharif · Doina Precup 
2014 Spotlight: Optimizing Energy Production Using Policy Search and Predictive State Representations »
Yuri Grinberg · Doina Precup · Michel Gendreau 
2013 Poster: Learning from Limited Demonstrations »
Beomjoon Kim · Amirmassoud Farahmand · Joelle Pineau · Doina Precup 
2013 Poster: Bellman Error Based Feature Generation using Random Projections on Sparse Spaces »
Mahdi Milani Fard · Yuri Grinberg · Amirmassoud Farahmand · Joelle Pineau · Doina Precup 
2013 Spotlight: Learning from Limited Demonstrations »
Beomjoon Kim · Amirmassoud Farahmand · Joelle Pineau · Doina Precup 
2012 Poster: Value Pursuit Iteration »
Amirmassoud Farahmand · Doina Precup 
2012 Poster: Online Reinforcement Learning Using Incremental KernelBased Stochastic Factorization »
Andre S Barreto · Doina Precup · Joelle Pineau 
2011 Poster: Reinforcement Learning using KernelBased Stochastic Factorization »
Andre S Barreto · Doina Precup · Joelle Pineau 
2009 Poster: Convergent TemporalDifference Learning with Arbitrary Smooth Function Approximation »
Hamid R Maei · Csaba Szepesvari · Shalabh Batnaghar · Doina Precup · David Silver · Richard Sutton 
2009 Spotlight: Convergent TemporalDifference Learning with Arbitrary Smooth Function Approximation »
Hamid R Maei · Csaba Szepesvari · Shalabh Batnaghar · Doina Precup · David Silver · Richard Sutton 
2008 Poster: Bounding Performance Loss in Approximate MDP Homomorphisms »
Doina Precup · Jonathan Taylor Taylor · Prakash Panangaden