Timezone: »

Workshop
Offline Reinforcement Learning
Rishabh Agarwal · Aviral Kumar · George Tucker · Justin Fu · Nan Jiang · Doina Precup · Aviral Kumar

Tue Dec 14 09:00 AM -- 06:20 PM (PST) @

Offline reinforcement learning (RL) is a re-emerging area of study that aims to learn behaviors using only logged data, such as data from previous experiments or human demonstrations, without further environment interaction. It has the potential to make tremendous progress in a number of real-world decision-making problems where active data collection is expensive (e.g., in robotics, drug discovery, dialogue generation, recommendation systems) or unsafe/dangerous (e.g., healthcare, autonomous driving, or education). Such a paradigm promises to resolve a key challenge to bringing reinforcement learning algorithms out of constrained lab settings to the real world. The first edition of the offline RL workshop, held at NeurIPS 2020, focused on and led to algorithmic development in offline RL. This year we propose to shift the focus from algorithm design to bridging the gap between offline RL research and real-world offline RL. Our aim is to create a space for discussion between researchers and practitioners on topics of importance for enabling offline RL methods in the real world. To that end, we have revised the topics and themes of the workshop, invited new speakers working on application-focused areas, and building on the lively panel discussion last year, we have invited the panelists from last year to participate in a retrospective panel on their changing perspectives.

For details on submission please visit: https://offline-rl-neurips.github.io/2021 (Submission deadline: October 6, Anywhere on Earth)

Speakers:
Aviv Tamar (Technion - Israel Inst. of Technology)
Angela Schoellig (University of Toronto)
Barbara Engelhardt (Princeton University)
Philip S. Thomas (UMass Amherst)

 Tue 9:00 a.m. - 9:10 a.m. Opening Remarks Rishabh Agarwal · Aviral Kumar 🔗 Tue 9:10 a.m. - 9:40 a.m. Learning to Explore From Data (Talk) Aviv Tamar 🔗 Tue 9:40 a.m. - 9:45 a.m. Q&A for Aviv Tamar (Q&A) Aviv Tamar 🔗 Tue 9:45 a.m. - 9:55 a.m. Contributed Talk 1: What Matters in Learning from Offline Human Demonstrations for Robot Manipulation (Talk) Ajay Mandlekar 🔗 Tue 10:00 a.m. - 10:10 a.m. Contributed Talk 2: What Would the Expert do?: Causal Imitation Learning (Talk) Gokul Swamy 🔗 Tue 10:15 a.m. - 10:25 a.m. Contributed Talk 3: Offline Reinforcement Learning: Fundamental Barriers for Value Function Approximation (Talk) Yunzong Xu · Akshay Krishnamurthy · David Simchi-Levi 🔗 Tue 10:30 a.m. - 10:40 a.m. Contributed Talk 4: PulseRL: Enabling Offline Reinforcement Learning for Digital Marketing Systems via Conservative Q-Learning (Talk) Luckeciano Carvalho Melo 🔗 Tue 10:40 a.m. - 11:45 a.m. Poster Session 1 (Poster Session) https://eventhosts.gather.town/app/cIcBclk1rC3IuihY/neurips-template-8 🔗 Tue 11:45 a.m. - 11:46 a.m. Speaker Intro (Speaker Introduction) Rishabh Agarwal · Aviral Kumar 🔗 Tue 11:46 a.m. - 12:16 p.m. Offline RL for Robotics (Talk) Angela Schoellig 🔗 Tue 12:16 p.m. - 12:21 p.m. Q&A for Angela Schoellig (Q&A) 🔗 Tue 12:21 p.m. - 12:22 p.m. Speaker Intro (Live short intro) Rishabh Agarwal · Aviral Kumar 🔗 Tue 12:22 p.m. - 12:52 p.m. Generalization theory in Offline RL (Talk) Sham Kakade 🔗 Tue 12:52 p.m. - 12:57 p.m. Q&A for Sham Kakade (Q&A) Sham Kakade 🔗 Tue 1:00 p.m. - 2:00 p.m. Invited Speaker Panel (Discussion Panel) Sham Kakade · Minmin Chen · Philip Thomas · Angela Schoellig · Barbara Engelhardt · Doina Precup · George Tucker 🔗 Tue 2:00 p.m. - 3:00 p.m. Retrospective Panel (Discussion Panel) Sergey Levine · Nando de Freitas · Emma Brunskill · Finale Doshi-Velez · Nan Jiang · Rishabh Agarwal 🔗 Tue 3:00 p.m. - 3:01 p.m. Speaker Intro Aviral Kumar · George Tucker 🔗 Tue 3:01 p.m. - 3:31 p.m. Offline RL for recommendation systems (Talk) Minmin Chen 🔗 Tue 3:31 p.m. - 3:36 p.m. Q&A for Minmin Chen (Q&A) Minmin Chen 🔗 Tue 4:06 p.m. - 4:07 p.m. Speaker Intro Aviral Kumar · George Tucker 🔗 Tue 4:07 p.m. - 4:37 p.m. Offline Reinforcement Learning for Hospital Patients When Every Patient is Different (Talk) Barbara Engelhardt 🔗 Tue 4:37 p.m. - 4:42 p.m. Q&A for Barbara Engelhardt (Q&A) 🔗 Tue 4:42 p.m. - 4:43 p.m. Speaker Intro (Introduction) 🔗 Tue 4:43 p.m. - 5:13 p.m. Advances in (High-Confidence) Off-Policy Evaluation (Talk) Philip Thomas 🔗 Tue 5:13 p.m. - 5:19 p.m. Q&A for Philip Thomas (Q&A) Philip Thomas 🔗 Tue 5:19 p.m. - 5:20 p.m. Closing Remarks & Poster Session (Closing Remarks) 🔗 Tue 5:20 p.m. - 6:20 p.m. Poster Session 2 (Poster Session) https://eventhosts.gather.town/app/cIcBclk1rC3IuihY/neurips-template-8 🔗 - Offline Reinforcement Learning with Soft Behavior Regularization (Poster) Most prior approaches to offline reinforcement learning (RL) utilize \textit{behavior regularization}, typically augmenting existing off-policy actor critic algorithms with a penalty measuring divergence between the policy and the offline data. However, these approaches lack guaranteed performance improvement over the behavior policy. In this work, we start from the performance difference between the learned policy and the behavior policy, we derive a new policy learning objective that can be used in the offline setting, which corresponds to the advantage function value of the behavior policy, multiplying by a state-marginal density ratio. We propose a practical way to compute the density ratio and demonstrate its equivalence to a state-dependent behavior regularization. Unlike state-independent regularization used in prior approaches, this \textit{soft} regularization allows more freedom of policy deviation at high confidence states, leading to better performance and stability. We thus term our resulting algorithm Soft Behavior-regularized Actor Critic (SBAC). Our experimental results show that SBAC matches or outperforms the state-of-the-art on a set of continuous control locomotion and manipulation tasks. Haoran Xu · Xianyuan Zhan · Li Jianxiong · Honglei Yin 🔗 - Instance-dependent Offline Reinforcement Learning: From tabular RL to linear MDPs (Poster) We study the \emph{offline reinforcement learning} (offline RL) problem, where the goal is to learn a reward-maximizing policy in an unknown \emph{Markov Decision Process} (MDP) using the data coming from a policy $\mu$. In particular, we consider the sample complexity problems of offline RL for the finite horizon MDPs. Prior works derive the information-theoretical lower bounds based on different data-coverage assumptions and their upper bounds are expressed by the covering coefficients which lack the explicit characterization of system quantities. In this work, we analyze the \emph{Adaptive Pessimistic Value Iteration} (APVI) algorithm and derive the suboptimality upper bound that nearly matches equation (1). We also prove an information-theoretical lower bound to show this quantity is required under the weak assumption that $d^\mu_h(s_h,a_h)>0$ if $d^{\pi^\star}_h(s_h,a_h)>0$. Here $\pi^\star$ is a optimal policy, $\mu$ is the behavior policy and $d(s_h,a_h)$ is the marginal state-action probability. We call this adaptive bound the \emph{intrinsic offline reinforcement learning bound} since it directly implies all the existing optimal results: minimax rate under uniform data-coverage assumption, horizon-free setting, single policy concentrability, and the tight problem-dependent results. Later, we extend the result to the \emph{assumption-free} regime (where we make no assumption on $\mu$) and obtain the assumption-free intrinsic bound. Due to its generic form, we believe the intrinsic bound could help illuminate what makes a specific problem hard and reveal the fundamental challenges in offline RL. Ming Yin · Yu-Xiang Wang 🔗 - DCUR: Data Curriculum for Teaching via Samples with Reinforcement Learning (Poster) Deep reinforcement learning (RL) has shown great empirical successes, but suffers from brittleness and sample inefficiency. A potential remedy is to use a previously-trained policy as a source of supervision. In this work, we refer to these policies as teachers and study how to transfer their expertise to new student policies by focusing on data usage. We propose a framework, Data CUrriculum for Reinforcement learning (DCUR), which first trains teachers using online deep RL, and stores the logged environment interaction history. Then, students learn by running either offline RL or by using teacher data in combination with a small amount of self-generated data. DCUR’s central idea involves defining a class of data curricula which, as a function of training time, limits the student to sampling from a fixed subset of the full teacher data. We test teachers and students using state-of-the-art deep RL algorithms across a variety of data curricula. Results suggest that the choice of data curricula significantly impacts student learning, and that it is beneficial to limit the data during early training stages while gradually letting the data availability grow over time. We identify when the student can learn offline and match teacher performance without relying on specialized offline RL algorithms. Furthermore, we show that collecting a small fraction of online data provides complementary benefits with the data curriculum. Supplementary material is available at https://sites.google.com/view/anon-dcur/. Daniel Seita · Abhinav Gopal · Mandi Zhao · John Canny 🔗 - What Matters in Learning from Offline Human Demonstrations for Robot Manipulation (Poster) Imitating human demonstrations is a promising approach to endow robots with various manipulation capabilities. While recent advances have been made in imitation learning and batch (offline) reinforcement learning, a lack of open-source human datasets and reproducible learning methods make assessing the state of the field difficult. In this paper, we conduct an extensive study of six offline learning algorithms for robot manipulation on five simulated and three real-world multi-stage manipulation tasks of varying complexity, and with datasets of varying quality. Our study analyzes the most critical challenges when learning from offline human data for manipulation. Based on the study, we derive a series of lessons including the sensitivity to different algorithmic design choices, the dependence on the quality of the demonstrations, and the variability based on the stopping criteria due to the different objectives in training and evaluation. We also highlight opportunities for learning from human datasets, such as the ability to learn proficient policies on challenging, multi-stage tasks beyond the scope of current reinforcement learning methods, and the ability to easily scale to natural, real-world manipulation scenarios where only raw sensory signals are available. Upon acceptance, we will open-source our datasets and all algorithm implementations to facilitate future research and fair comparisons in learning from human demonstration data. Additional results and videos at https://sites.google.com/view/offline-demo-study Ajay Mandlekar · Danfei Xu · Josiah Wong · Chen Wang · Li Fei-Fei · Silvio Savarese · Yuke Zhu · Roberto Martín-Martín 🔗 - TiKick: Toward Playing Multi-agent Football Full Games from Single-agent Demonstrations (Poster) Deep reinforcement learning (DRL) has achieved super-human performance on complex video games (e.g., StarCraft II and Dota II). However, current DRL systems still suffer from challenges of multi-agent coordination, sparse rewards, stochastic environments, etc. In seeking to address these challenges, we employ a football video game, e.g., Google Research Football (GRF), as our testbed and develop an end-to-end learning-based AI system (denoted as TiKick) to complete this challenging task. In this work, we first generated a large replay dataset from the self-playing of single-agent experts, which are obtained from league training. We then developed a new offline algorithm to learn a powerful multi-agent AI from the fixed single-agent dataset. To the best of our knowledge, Tikick is the first learning-based AI system that can take over the multi-agent Google Research Football full game, while previous work could either control a single agent or experiment on toy academic scenarios. Extensive experiments further show that our pre-trained model can accelerate the training process of the modern multi-agent algorithm and our method achieves state-of-the-art performances on various academic scenarios. Shiyu Huang · Wenze Chen · Longfei Zhang · Shizhen Xu · Ziyang Li · Fengming Zhu · Deheng Ye · Ting Chen · Jun Zhu 🔗 - d3rlpy: An Offline Deep Reinforcement Learning Library (Poster) In this paper, we introduce d3rlpy, an open-sourced offline deep reinforcement learning (RL) library for Python. d3rlpy supports a number of offline deep RL algorithms as well as online algorithms via a user-friendly API. To assist deep RL research and development projects, d3rlpy provides practical and unique features such as data collection, exporting policies for deployment, preprocessing and postprocessing, distributional Q-functions, multi-step learning and a convenient command-line interface. Furthermore, d3rlpy additionally provides a novel graphical interface that enables users to train offline RL algorithms without coding programs. Lastly, the implemented algorithms are benchmarked with D4RL datasets to ensure the implementation quality. Takuma Seno · Michita Imai 🔗 - PulseRL: Enabling Offline Reinforcement Learning for Digital Marketing Systems via Conservative Q-Learning (Poster) Digital Marketing Systems (DMS) are the primary point of contact between a digital business and its customers. In this context, the communication channel optimization problem poses a precious and still open challenge for DMS. Due to its interactive nature, Reinforcement Learning (RL) appears as a promising formulation for this problem. However, the standard RL setting learns from interacting with the environment, which is costly and dangerous for production systems. Furthermore, it also fails to learn from historical interactions due to the distributional shift between the collection and learning policies. For this matter, we present PulseRL, an offline RL-based production system for communication channel optimization built upon the Conservative Q-Learning (CQL) Framework. PulseRL architecture comprises the whole engineering pipeline (data processing, training, deployment, and monitoring), scaling to handle millions of users. Using CQL, PulseRL learns from historical logs, and its learning objective reduces the shift problem by mitigating the overestimation bias from out-of-distribution actions. We conducted experiments in a real-world DMS. Results show that PulseRL surpasses RL baselines with a significant margin in the online evaluation. They also validate the theoretical properties of CQL in a complex scenario with high sampling error and non-linear function approximation. Luckeciano Carvalho Melo · Luana G B Martins · Bryan Lincoln de Oliveira · Bruno Brandão · Douglas Winston Soares · Telma Lima 🔗 - Latent Geodesics of Model Dynamics for Offline Reinforcement Learning (Poster) Model-based offline reinforcement learning approaches generally rely on bounds of model error. While contemporary methods achieve such bounds through an ensemble of models, we propose to estimate them using a data-driven latent metric. Particularly, we build upon recent advances in Riemannian geometry of generative models to construct a latent metric of an encoder-decoder based forward model. Our proposed metric measures both the quality of out of distribution samples as well as the discrepancy of examples in the data. We show that our metric can be viewed as a combination of two metrics, one relating to proximity and the other to epistemic uncertainty. Finally, we leverage our metric in a pessimistic model-based framework, showing a significant improvement upon contemporary model-based offline reinforcement learning benchmarks. Guy Tennenholtz · Nir Baram · Shie Mannor 🔗 - Domain Knowledge Guided Offline Q Learning (Poster) Offline reinforcement learning (RL) is a promising method for applications where direct exploration is not possible but a decent initial model is expected for the online stage. In practice, offline RL can underperform because of overestimation attributed to distributional shift between the training data and the learned policy. A common approach to mitigating this issue is to constrain the learned policies so that they remain close to the fixed batch of interactions. This method is typically used without considering the application context. However, domain knowledge is available in many real-world cases and may be utilized to effectively handle the issue of out-of-distribution actions. Incorporating domain knowledge in training avoids additional function approximation to estimate the behavior policy and results in easy-to-interpret policies. To encourage the adoption of offline RL in practical applications, we propose the Domain Knowledge guided Q learning (DKQ). We show that DKQ is a conservative approach, where the unique fixed point still exists and is upper bounded by the standard optimal Q function. DKQ also leads to lower chance of overestimation. In addition, we demonstrate the benefit of DKQ empirically via a novel, real-world case study - guided family tree building, which appears to be the first application of offline RL in genealogy. The results show that guided by proper domain knowledge, DKQ can achieve similar offline performance as standard Q learning and is better aligned with the behavior policy revealed from the data, indicating a lower risk of overestimation on unseen actions. Further, we demonstrate the efficiency and flexibility of DKQ with a classical control problem. Xiaoxuan Zhang · Sijia Zhang · Yen-Yun Yu 🔗 - Understanding the Effects of Dataset Characteristics on Offline Reinforcement Learning (Poster) In real world, affecting the environment by a weak policy can be expensive or very risky, therefore hampers real world applications of reinforcement learning. Offline Reinforcement Learning (RL) can learn policies from a given dataset without interacting with the environment. However, the dataset is the only source of information for an Offline RL algorithm and determines the performance of the learned policy. We still lack studies on how dataset characteristics influence different Offline RL algorithms. Therefore, we conducted a comprehensive empirical analysis of how dataset characteristics effect the performance of Offline RL algorithms for discrete action environments. A dataset is characterized by two metrics: (1) the Trajectory Quality (TQ) measured by the average dataset return and (2) the State-Action Coverage (SACo) measured by the number of unique state-action pairs. We found that variants of the off-policy Deep Q-Network family require datasets with high SACo to perform well. Algorithms that constrain the learned policy towards the given dataset perform well for datasets with high TQ or SACo. For datasets with high TQ, Behavior Cloning outperforms or performs similarly to the best Offline RL algorithms. Kajetan Schweighofer · Markus Hofmarcher · Marius-Constantin Dinu · Philipp Renz · Angela Bitto · Vihang Patil · Sepp Hochreiter 🔗 - Unsupervised Learning of Temporal Abstractions using Slot-based Transformers (Poster) The discovery of reusable sub-routines simplifies decision-making and planning in complex reinforcement learning problems. Previous approaches propose to learn such temporal abstractions in a purely unsupervised fashion through observing state-action trajectories gathered from executing a policy. However, a current limitation is that they process each trajectory in an entirely sequential manner, which prevents them from revising earlier decisions about sub-routine boundary points in light of new incoming information. In this work we propose SloTTAr, a fully parallel approach that integrates sequence processing Transformers with a Slot Attention module for learning about sub-routines in an unsupervised fashion. We demonstrate how SloTTAr is capable of outperforming strong baselines in terms of boundary point discovery, while being up to 30x faster on existing benchmarks. Anand Gopalakrishnan · Kazuki Irie · Jürgen Schmidhuber · Sjoerd van Steenkiste 🔗 - Counter-Strike Deathmatch with Large-Scale Behavioural Cloning (Poster) This paper describes an AI agent that plays the modern first-person-shooter (FPS) video game Counter-Strike; Global Offensive' (CSGO) from pixel input. The agent, a deep neural network, matches the performance of the medium difficulty built-in AI on the deathmatch game mode whilst adopting a humanlike play style. Previous research has mostly focused on games with convenient APIs and low-resolution graphics, allowing them to be run cheaply at scale. This is not the case for CSGO, with system requirements 100$\times$ that of previously studied FPS games. This limits the quantity of on-policy data that can be generated, precluding many reinforcement learning algorithms. Our solution uses behavioural cloning — training on a large noisy dataset scraped from human play on online servers (5.5 million frames or 95 hours), and smaller datasets of clean expert demonstrations. This scale is an order of magnitude larger than prior work on imitation learning in FPS games. To introduce this challenging environment to the AI community, we open source code and datasets. Tim Pearce · Jun Zhu 🔗 - Modern Hopfield Networks for Return Decomposition for Delayed Rewards (Poster) Delayed rewards, which are separated from their causative actions by irrelevant actions, hamper learning in reinforcement learning (RL). Especially real world problems often contain such delayed and sparse rewards. Recently, return decomposition for delayed rewards (RUDDER) employed pattern recognition to remove or reduce delay in rewards, which dramatically simplifies the learning task of the underlying RL method. RUDDER was realized using a long short-term memory (LSTM). The LSTM was trained to identify important state-action pair patterns, responsible for the return. Reward was then redistributed to these important state-action pairs. However, training the LSTM is often difficult and requires a large number of episodes. In this work, we replace the LSTM with the recently proposed continuous modern Hopfield networks (MHN) and introduce Hopfield-RUDDER. MHN are powerful trainable associative memories with large storage capacity. They require only few training samples and excel at identifying and recognizing patterns. We use this property of MHN to identify important state-action pairs that are associated with low or high return episodes and directly redistribute reward to them. However, in partially observable environments, Hopfield-RUDDER requires additional information about the history of state-action pairs. Therefore, we evaluate several methods for compressing history and introduce reset-max history, a lightweight history compression using the max-operator in combination with a reset gate. We experimentally show that Hopfield-RUDDER is able to outperform LSTM-based RUDDER on various 1D environments with small numbers of episodes. Finally, we show in preliminary experiments that Hopfield-RUDDER scales to highly complex environments with the Minecraft ObtainDiamond task from the MineRL NeurIPS challenge. Michael Widrich · Markus Hofmarcher · Vihang Patil · Angela Bitto · Sepp Hochreiter 🔗 - Pessimistic Model-based Offline Reinforcement Learning under Partial Coverage (Poster) We study model-based offline Reinforcement Learning with general function approximation without a full coverage assumption on the offline data distribution. We present an algorithm named Constrained Pessimistic Policy Optimization (CPPO) which leverages a general function class and uses a constraint over the models to encode pessimism. Under the assumption that the ground truth model belongs to our function class (i.e., realizability in the function class), CPPO has a PAC guarantee with offline data only providing partial coverage, i.e., it can learn a policy that competes against any policy covered by the offline data. We then demonstrate that this algorithmic framework can be applied to many specialized Markov Decision Processes where the additional structural assumptions can further refine the concept of partial coverage. Two notable examples are: (1) low- rank MDP with representation learning where the partial coverage condition is defined using a relative condition number measured by the unknown ground truth feature representation; (2) factored MDP where the partial coverage condition is defined using density-ratio based concentrability coefficients associated with individual factors. Masatoshi Uehara · Wen Sun 🔗 - Importance of Representation Learning for Off-Policy Fitted Q-Evaluation (Poster) The goal of offline policy evaluation (OPE) is to evaluate target policies based on logged data with a possibly much different distribution. One of the most popular empirical approaches to OPE is fitted Q-evaluation (FQE). With linear function approximation, several works have found that FQE (and other OPE methods) exhibit exponential error amplification in the problem horizon, except under very strong assumptions. Given the empirical success of deep FQE, in this work we examine the effect of implicit regularization through deep architectures and loss functions on the divergence and performance of FQE. We find that divergence does occur with simple feed-forward architectures, but can be mitigated using various architectures and algorithmic techniques, such as ResNet architectures, learning a shared representation between multiple target policies, and hypermodels. Our results suggest interesting directions for future work, including analyzing the effect of architecture on stability of fixed-point updates which are ubiquitous in modern reinforcement learning. Carrie Wu · Nevena Lazic · Dong Yin · Cosmin Paduraru 🔗 - Offline Contextual Bandits for Wireless Network Optimization (Poster) The explosion in mobile data traffic together with the ever-increasing expectations for higher quality of service call for the development of new AI algorithms for wireless network optimization. In this paper, we investigate how to learn policies that can automatically adjust the configuration parameters of every cell in the network in response to the changes in the user demand. Our solution combines existent methods for offline learning and adapts them in a principled way to overcome crucial challenges arising in this context. Empirical results suggest that our proposed method will achieve important performance gains when deployed in the real network while satisfying practical constraints on computational efficiency. Miguel Suau de Castro 🔗 - Robust On-Policy Data Collection for Data-Efficient Policy Evaluation (Poster) This paper considers how to complement offline reinforcement learning (RL) data with additional data collection for the task of policy evaluation. In policy evaluation, the task is to estimate the expected return of an evaluation policy on an environment of interest. Prior work on offline policy evaluation typically only considers a static dataset. We consider a setting where we can collect a small amount of additional data to combine with a potentially larger offline RL dataset. We show that simply running the evaluation policy – on-policy data collection – is sub-optimal for this setting. We then introduce two new data collection strategies for policy evaluation, both of which consider previously collected data when collecting future data so as to reduce distribution shift (or sampling error) in the entire dataset collected. Our empirical results show that compared to on-policy sampling, our strategies produce data with lower sampling error and generally lead to lower mean-squared error in policy evaluation for any total dataset size. We also show that these strategies can start from initial off-policy data, collect additional data, and then use both the initial and new data to produce low mean-squared error policy evaluation without using off-policy corrections. Rujie Zhong · Josiah Hanna · Lukas Schäfer · Stefano Albrecht 🔗 - Doubly Pessimistic Algorithms for Strictly Safe Off-Policy Optimization (Poster) Safety in reinforcement learning (RL) has become increasingly important in recent years. Yet, many of existing solutions fail to strictly avoid choosing unsafe actions, which may lead to catastrophic results in safety-critical systems. In this paper, we study offline RL in the presence of safety requirements: from a dataset collected a priori and without direct access to the true environment, learn an optimal policy that is guaranteed to respect the safety constraints. We first address this problem by modeling the safety requirement as an unknown cost function of states and actions, whose expected value with respect to the policy must fall below a certain threshold. We then present an algorithm in the context of finite-horizon Markov decision processes (MDPs), termed Safe-DPVI that performs in a doubly pessimistic manner when 1) it constructs a conservative set of safe policies; and 2) when it selects a good policy from that conservative set. Without assuming the sufficient coverage of the dataset or any structure for the underlying MDPs, we establish a data-dependent upper bound on the suboptimality gap of the \emph{safe} policy Safe-DPVI returns. We then specialize our results to linear MDPs with appropriate assumptions on dataset being well-explored. Both data-dependent and specialized upper bounds nearly match that of state-of-the-art unsafe offline RL algorithms, with an additional multiplicative factor $\frac{\sum_{h=1}^H\alpha_{h}}{H}$, where $\alpha_h$ characterizes the safety constraint at time-step $h$. We further present numerical simulations that corroborate our theoretical findings. Sanae Amani · Lin Yang 🔗 - OFFLINE RL WITH RESOURCE CONSTRAINED ONLINE DEPLOYMENT (Poster) Offline reinforcement learning is used to train policies in scenarios where real-time access to the environment is expensive or impossible. As a natural consequence of these harsh conditions, an agent may lack the resources to fully observe the online environment before taking an action. We dub this situation the \newterm{resource-constrained} setting. This leads to situations where the offline dataset (available for training) can contain fully processed features (using powerful language models, image models, complex sensors, etc.) which are not available when actions are actually taken online. This disconnect leads to an interesting and unexplored problem in offline RL: \textbf{Is it possible to use a richly processed offline dataset to train a policy which has access to fewer features in the online environment?} In this work, we introduce and formalize this novel resource-constrained problem setting. We highlight the performance gap between policies trained using the full offline dataset and policies trained using limited features. We address this performance gap with a \newterm{policy transfer algorithm} which first trains a teacher agent using the offline dataset where features are fully available, and then transfers this knowledge to a student agent that only uses the resource-constrained features. To better capture the challenge of this setting, we propose a data collection procedure: Resource Constrained-Datasets for RL (RC-D4RL). We evaluate our transfer algorithm on RC-D4RL and the popular D4RL benchmarks and observe consistent improvement over the baseline (TD3+BC without transfer). Jayanth Reddy Regatti · Aniket Anand Deshmukh · Young Jung · Abhishek Gupta · Urun Dogan 🔗 - Personalization for Web-based Services using Offline Reinforcement Learning (Poster) Large-scale Web-based services present opportunities for improving UI policies based on observed user interactions. We investigate both the sequential and non-sequential formulations, highlighting their benefits and drawbacks. In the sequential setting, we address challenges of learning such policies through model-free offline Reinforcement Learning (RL) with off-policy training. Deployed in a production system for user authentication in a major social network, it significantly improves long-term objectives. We articulate practical challenges, compare several ML techniques, provide insights on training and evaluation of RL models, and discuss generalizations. Pavlos A Apostolopoulos · Zehui Wang · Hanson Wang · Chad Zhou · Kittipat Virochsiri · Norm Zhou · Igor Markov 🔗 - Offline Reinforcement Learning with Implicit Q-Learning (Poster) Offline reinforcement learning requires reconciling two conflicting aims: learning a policy that improves over the behavior policy that collected the dataset, while at the same time minimizing the deviation from the behavior policy so as to avoid errors due to distributional shift. This tradeoff is critical, because most current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy, and therefore need to either constrain these actions to be in-distribution, or else regularize their values. We propose a new offline RL method that never needs to evaluate actions outside of the dataset, but still enables the learned policy to improve substantially over the best behavior in the data through generalization. The main insight in our work is that, instead of evaluating unseen actions from the latest policy, we can approximate the policy improvement step implicitly by treating the state value function as a random variable, with randomness determined by the action (while still integrating over the dynamics to avoid excessive optimism), and then taking a state conditional upper expectile of this random variable to estimate the value of the best actions in that state. This leverages the generalization capacity of the function approximator to estimate the value of the best available action at a given state without ever directly querying a Q-function with this unseen action. Our algorithm alternates between fitting this upper expectile value function and backing it up into a Q-function, without any explicit policy. Then, we extract the policy via advantage-weighted behavioral cloning, which also avoids querying out-of-sample actions. We dub our method implicit Q-learning (IQL). IQL is easy to implement, computationally efficient, and only requires fitting an additional critic with an asymmetric L2 loss. Ilya Kostrikov · Ashvin Nair · Sergey Levine 🔗 - Pessimistic Model Selection for Offline Deep Reinforcement Learning (Poster) Deep Reinforcement Learning (DRL) has demonstrated great potentials in solving sequential decision making problems in many applications. Despite its promising performance, practical gaps exist when deploying DRL in real-world scenarios. One main barrier is the over-fitting issue that leads to poor generalizability of the policy learned by DRL. In particular, for offline DRL with observational data, model selection is a challenging task as there is no ground truth available for performance demonstration, in contrast with the online setting with simulated environments. In this work, we propose a pessimistic model selection (PMS) approach for offline DRL with a theoretical guarantee, which features a tuning-free framework for finding the best policy among a set of candidate models. Two refined approaches are also proposed to address the potential bias of DRL model in identifying the optimal policy. Numerical studies demonstrated the superior performance of our approach over existing methods. Huck Yang · Yifan Cui · Pin-Yu Chen 🔗 - BATS: Best Action Trajectory Stitching (Poster) The problem of offline reinforcement learning focuses on learning a good policy from a log of environment interactions. Past efforts for developing algorithms in this area have revolved around introducing constraints to online reinforcement algorithms to ensure the actions of the learned policy are constrained to the logged data. In this work, we explore an alternative approach by planning on the fixed dataset directly. Specifically, we introduce an algorithm which forms a tabular Markov Decision Process (MDP) over the logged data by adding new transitions to the dataset. We do this by using learned dynamics models to plan short trajectories between states. Since exact value iteration can be performed on this constructed MDP, it becomes easy to identify which trajectories are advantageous to add to the MDP. Crucially, since most transitions in this MDP come from the logged data, trajectories from the MDP can be rolled out for long periods with confidence. We prove that this property allows one to make upper and lower bounds on the value function up to appropriate distance metrics. Finally, we demonstrate empirically how algorithms that uniformly constrain the learned policy to the entire dataset can result in unwanted behavior, and we show an example in which simply behavior cloning the optimal policy of the MDP created by our algorithm avoids this problem. Ian Char · Viraj Mehta · Adam Villaflor · John Dolan · Jeff Schneider 🔗 - Single-Shot Pruning for Offline Reinforcement Learning (Poster) Deep Reinforcement Learning (RL) is a powerful framework for solving complex real-world problems. Large neural networks employed in the framework are traditionally associated with better generalization capabilities, but their increased size entails the drawbacks of extensive training duration, substantial hardware resources, and longer inference times. One way to tackle this problem is to prune neural networks leaving only the necessary parameters. State-of-the-art concurrent pruning techniques for imposing sparsity perform demonstrably well in applications where data-distributions are fixed. However, they have not yet been substantially explored in the context of RL. We close the gap between RL and single-shot pruning techniques and present a general pruning approach to the Offline RL. We leverage a fixed dataset to prune neural networks before the start of RL training. We then run experiments varying the network sparsity level and evaluating the validity of pruning at initialization techniques in continuous control tasks. Our results show that with 95% of the network weights pruned, Offline-RL algorithms can still retain performance in the majority of our experiments. To the best of our knowledge no prior work utilizing pruning in RL retained performance at such high levels of sparsity. Moreover, pruning at initialization techniques can be easily integrated into any existing Offline-RL algorithms without changing the learning objective. Samin Yeasar Arnob · riyasat.ohib · Sergey Plis · Doina Precup 🔗 - Offline neural contextual bandits: Pessimism, Optimization and Generalization (Poster) Offline policy learning (OPL) leverages existing data collected a priori for policy optimization without any active exploration. Despite the prevalence and recent interest in this problem, its theoretical and algorithmic foundations in function approximation settings remain under-developed. In this paper, we consider this problem on the axes of distributional shift, optimization, and generalization in offline contextual bandits with neural networks. In particular, we propose a provably efficient offline contextual bandit with neural network function approximation that does not require any functional assumption on the reward. We show that our method provably generalizes over unseen contexts under a milder condition for distributional shift than the existing OPL works. Notably, unlike any other OPL method, our method learns from the offline data in an online manner using stochastic gradient descent, allowing us to leverage the benefits of online learning into an offline setting. Moreover, we show that our method is more computationally efficient and has a better dependence on the effective dimension of the neural network than an online counterpart. Finally, we demonstrate the empirical effectiveness of our method in a range of synthetic and real-world OPL problems Thanh Nguyen-Tang · Sunil Gupta · A. Tuan Nguyen · Svetha Venkatesh 🔗 - Improving Zero-shot Generalization in Offline Reinforcement Learning using Generalized Similarity Functions (Poster) Reinforcement learning (RL) agents are widely used for solving complex sequential decision-making tasks, but still exhibit difficulty in generalizing to scenarios not seen during training. While prior online approaches demonstrated that using additional signals beyond the reward function can lead to better generalization capabilities in RL agents, i.e. using self-supervised learning (SSL), they struggle in the offline RL setting, i.e. learning from a static dataset. We show that performance of online algorithms for generalization in RL can be hindered in the offline setting due to poor estimation of similarity between observations. We propose a new theoretically-motivated framework called Generalized Similarity Functions (GSF), which uses contrastive learning to train an offline RL agent to aggregate observations based on the similarity of their expected future behavior, where we quantify this similarity using generalized value functions. We show that GSF is general enough to recover existing SSL objectives while also improving zero-shot generalization performance on a complex offline RL benchmark, offline Procgen. Bogdan Mazoure · Ilya Kostrikov · Ofir Nachum · Jonathan Tompson 🔗 - Adaptive Behavior Cloning Regularization for Stable Offline-to-Online Reinforcement Learning (Poster) Offline reinforcement learning, by learning from a fixed dataset, makes it possible to learn agent behaviors without interacting with the environment. However, depending on the quality of the offline dataset, such pre-trained agents may have limited performance and would further need to be fine-tuned online by interacting with the environment. During online fine-tuning, the performance of the pre-trained agent may collapse quickly due to the sudden distribution shift from offline to online data. While constraints enforced by offline RL methods such as a behaviour cloning loss prevent this to an extent, these constraints also significantly slow down online fine-tuning by forcing the agent to stay close to the behavior policy. We propose to adaptively weigh the behavior cloning loss during online fine-tuning based on the agent's performance and training stability. Moreover, we use a randomized ensemble of Q functions to further increase the sample efficiency of online fine-tuning by performing a large number of learning updates. Experiments show that the proposed method yields state-of-the-art offline-to-online reinforcement learning performance on the popular D4RL benchmark. Yi Zhao · Rinu Boney · Alexander Ilin · Juho Kannala · Joni Pajarinen 🔗 - What Would the Expert $do(\cdot)$?: Causal Imitation Learning (Poster) We develop algorithms for imitation learning from policy data that was corrupted by unobserved confounders. Sources of such confounding include (a) persistent perturbations to actions or (b) the expert responding to a part of the state that the learner does not have access to. When a confounder affects multiple timesteps of recorded data, it can manifest as spurious correlations between states and actions that a learner might latch on to, leading to poor policy performance. To break up these spurious correlations, we apply modern variants of the classical instrumental variable regression (IVR) technique, enabling us to recover the causally correct underlying policy without requiring access to an interactive expert. In particular, we present two techniques, one of a generative-modeling flavor (DoubIL) that can utilize access to a simulator and one of a game-theoretic flavor (ResiduIL) that can be run entirely offline. We discuss, from the perspective of performance, the types of confounding under which it is better to use an IVR-based technique instead of behavioral cloning and vice versa. We find both of our algorithms compare favorably to behavioral cloning on a simulated rocket landing task. Gokul Swamy · Sanjiban Choudhury · James Bagnell · Steven Wu 🔗 - Quantile Filtered Imitation Learning (Poster) We introduce quantile filtered imitation learning (QFIL), a novel policy improvement operator designed for offline reinforcement learning. QFIL performs policy improvement by running imitation learning on a filtered version of the experience dataset. The filtering process removes s,a pairs whose estimated Q values fall below a given quantile of the pushforward distribution over values induced by sampling actions from the behavior policy. The definitions of both the pushforward Q distribution and resulting value function quantile are key contributions of our method. We prove that QFIL gives us a safe policy improvement step with function approximation and that the choice of quantile provides a natural hyperparameter to trade off bias and variance of the improvement step. Empirically, we perform a synthetic experiment illustrating how QFIL effectively makes a bias-variance tradeoff and we see that QFIL performs well on the D4RL benchmark. David Brandfonbrener · Will Whitney · Rajesh Ranganath · Joan Bruna 🔗 - Benchmarking Sample Selection Strategies for Batch Reinforcement Learning (Poster) Training sample selection techniques, such as prioritized experience replay (PER), have been recognized as of significant importance for online reinforcement learn- ing algorithms. Efficient sample selection can help further improve the learning efficiency and the final performance. However, the impact of sample selection for batch reinforcement learning (RL) has not been well studied. In this work, we investigate the application of non-uniform sampling techniques in batch RL. In particular, we compare six variants of PER based on various heuristic priority metrics that focus on different aspects of the offline learning setting. These metrics include temporal-difference error, n-step return, self-imitation learning objective, pseudo-count, uncertainty, and likelihood. Through extensive experiments on the standard batch RL datasets, we find that non-uniform sampling is also effective in batch RL settings. Further, there is no single metric that works in all situations. The investigation also shows that it is insufficient to avoid the bootstrapping error in batch reinforcement learning by only changing the sampling scheme. Yuwei Fu · Di Wu · Benoit Boulet 🔗 - Dynamic Mirror Descent based Model Predictive Control for Accelerating Robot Learning (Poster) Recent works in Reinforcement Learning (RL) combine model-free (Mf)-RL algorithms with model-based (Mb)-RL approaches to get the best from both: asymptotic performance of Mf-RL and high sample efficiency of Mb-RL. Inspired by these works, we propose a hierarchical framework that integrates online learning for the Mb-trajectory optimization with off-policy methods for the Mf-RL. In particular, two loops are proposed, where the Dynamic Mirror Descent based Model Predictive Control (DMD-MPC) is used as the inner loop Mb-RL to obtain an optimal sequence of actions. These actions are in turn used to significantly accelerate the outer loop Mf-RL. We show that our formulation is generic for a broad class of MPC-based policies and objectives, and includes some of the well-known Mb-Mf approaches. We finally introduce a new algorithm: Mirror-Descent Model Predictive RL (M-DeMoRL), which uses Cross-Entropy Method (CEM) with elite fractions for the inner loop. Our experiments show faster convergence of the proposed hierarchical approach on benchmark MuJoCo tasks. We also demonstrate hardware training for trajectory tracking in a 2R leg, and hardware transfer for robust walking in a quadruped. We show that the inner-loop Mb-RL significantly decreases the number of training iterations required in the real system, thereby validating the proposed approach. Utkarsh A Mishra · Soumya Samineni · Aditya Varma Sagi · Shalabh Bhatnagar Bhatnagar · Shishir N Y 🔗 - MBAIL: Multi-Batch Best Action Imitation Learning utilizing Sample Transfer and Policy Distillation (Poster) Most online reinforcement learning (RL) algorithms require a large number of interactions with the environment to learn a reliable control policy. Unfortunately, the assumption of the availability of repeated interactions with the environment does not hold for many real-world applications. Batch RL aims to learn a good control policy from a previously collected dataset without requiring additional interactions with the environment, which are very promising in solving real-world problems. However, in the real world, we may only have a limited amount of data points for certain tasks we are interested in. Also, most of the current batch RL methods are mainly aimed to learn policy over one fixed dataset with which it is hard to learn a policy that can perform well over multiple tasks. In this work, we propose to tackle these challenges with sample transfer and policy distillation. The proposed methods are evaluated on multiple control tasks to showcase their effectiveness. Di Wu · tianyu.li · David Meger · Michael Jenkin · Steve Liu · Gregory Dudek 🔗 - Showing Your Offline Reinforcement Learning Work: Online Evaluation Budget Matters (Poster) Over the recent years, vast progress has been made in Offline Reinforcement Learning (Offline-RL) for various decision-making domains: from finance to robotics. However, comparing and reporting new Offline-RL algorithms has been noted as underdeveloped: (1) use of unlimited online evaluation budget for hyperparameter search (2) sidestepping offline policy selection (3) ad-hoc performance statistics reporting. In this work, we propose an evaluation technique addressing these issues, Expected Online Performance, that provides a performance estimate for a best-found policy given a fixed online evaluation budget. Using our approach, we can estimate the number of online evaluations required to surpass a given behavioral policy performance. Applying it to several Offline-RL baselines, we find that with a limited online evaluation budget, (1) Behavioral Cloning constitutes a strong baseline over various expert levels and data regimes, and (2) offline uniform policy selection is competitive with value-based approaches. We hope the proposed technique will make it into the toolsets of Offline-RL practitioners to help them arrive at informed conclusions when deploying RL in real-world systems. Vladislav Kurenkov · Sergey Kolesnikov 🔗 - Offline Reinforcement Learning with Munchausen Regularization (Poster) Most temporal differences based (TD-based) Reinforcement Learning (RL) methods focus on replacing the true value of a transiting state by their current estimate of this value. Munchausen-RL (M-RL) proposes the idea of incorporating the current policy to be leveraged to bootstrap RL. The concept of penalizing two consecutive policies that are far from each other is also applicable to offline settings. In our work, we add the Munchausen term in the Q-update step to penalize policies that deviate from previous policy too far. Our results indicate that this method could be implemented in various offline Q-learning methods to help improve the performance. In addition, we evaluate how prioritized experience replay affects offline RL. Our results show that Munchausen Offline RL outperforms the original methods that are without the regularization term. Hsin-Yu Liu · Bharathan Balaji · Dezhi Hong 🔗 - Importance of Empirical Sample Complexity Analysis for Offline Reinforcement Learning (Poster) We hypothesize that empirically studying the sample complexity of offline reinforcement learning (RL) is crucial for the practical applications of RL in the real world. Several recent works have demonstrated the ability to learn policies directly from offline data. In this work, we ask the question of the dependency on the number of samples for learning from offline data. Our objective is to emphasize that studying sample complexity for offline RL is important, and is an indicator of the usefulness of existing offline algorithms. We propose an evaluation approach for sample complexity analysis of offline RL. Samin Yeasar Arnob · Riashat Islam · Doina Precup 🔗 - Discrete Uncertainty Quantification Approach for Offline RL (Poster) In many Reinforcement Learning tasks, the classical online interaction of the learning agent with the environment is impractical, either because such interaction is expensive or dangerous. In these cases, previous gathered data can be used, arising what is typically called Offline Reinforcement Learning. However, this type of learning faces a large number of challenges, mostly derived from the fact that exploration/exploitation trade-off is overshadowed. Instead, the historical data is usually biased by the way it was obtained, typically, a sub-optimal controller, producing a distributional shift from historical data and the one required to learn the optimal policy. Javier Corrochano · Rubén Majadas · FERNANDO FERNANDEZ 🔗 - Pretraining for Language-Conditioned Imitation with Transformers (Poster) We study reinforcement learning (RL) agents which can utilize language inputs and efficiently learn on downstream tasks. To investigate this, we propose a new multimodal benchmark -- Text-Conditioned Frostbite -- in which an agent must complete tasks specified by text instructions in the Atari Frostbite environment. We curate and release a dataset of 5M text-labelled transitions for training, and to encourage further research in this direction. On this benchmark, we evaluate Text Decision Transformer (TDT), a transformer directly operating on text, state, and action tokens, and find it improves upon baseline architectures. Furthermore, we evaluate the effect of pretraining, finding unsupervised pretraining can yield improved results in low-data settings. Aaron Putterman · Kevin Lu · Igor Mordatch · Pieter Abbeel 🔗 - Stateful Offline Contextual Policy Evaluation and Learning (Poster) We study off-policy evaluation and learning from sequential data in a structured class of Markov decision processes that arise from repeated interactions with an exogenous sequence of arrivals with contexts, which generate unknown individual-level responses to agent actions that induce known transitions. This is a relevant model, for example, for dynamic personalized pricing and other operations management problems in the presence of potentially high-dimensional user types. The individual-level response is not causally affected by the state variable. In this setting, we adapt doubly-robust estimation in the single-timestep setting to the sequential setting so that a state-dependent policy can be learned even from a single timestep's worth of data. We introduce a \textit{marginal MDP} model and study an algorithm for off-policy learning, which can be viewed as fitted value iteration in the marginal MDP. We also provide structural results on when errors in the response model leads to the persistence, rather than attenuation, of error over time. In simulations, we show that the advantages of doubly-robust estimation in the single time-step setting, via unbiased and lower-variance estimation, can directly translate to improved out-of-sample policy performance. This structure-specific analysis sheds light on the underlying structure on a class of problems, operations research/management problems, often heralded as a real-world domain for offline RL, which are in fact qualitatively easier. Angela Zhou 🔗 - Offline Reinforcement Learning: Fundamental Barriers for Value Function Approximation (Poster) We consider offline reinforcement learning, where the goal is to learn a decision making policy from logged data. Offline RL—particularly when coupled with (value) function approximation to allow for generalization in large/continuous state spaces—is becoming increasingly relevant in practice, because it avoids costly and time-consuming online data collection and is well-suited to safety-critical domains. Existing sample complexity guarantees for offline value function approximation methods typically require both (1) distributional assumptions (i.e., good coverage) and (2) representational assumptions (i.e., ability to represent some or all Q-value functions) stronger than what is required for supervised learning. However, the necessity of these conditions and the fundamental limits for offline RL are not well-understood in spite of decades of research. This led Chen and Jiang (2019) to conjecture that concentrability (the most standard notion of coverage) and realizability (the weakest representation condition) alone are not sufficient for sample-efficient offline RL. We resolve this conjecture in the positive by proving (information theoretically) that even if both concentrability and realizability are satisfied, any algorithm requires sample complexity polynomial in the size of the state space to learn a non-trivial policy. Our results show that sample-efficient offline reinforcement learning requires either restrictive coverage conditions or representation conditions beyond what is required in classical supervised learning, and highlight a phenomenon called over-coverage which serves as a fundamental barrier for offline value function approximation methods. Dylan Foster · Akshay Krishnamurthy · David Simchi-Levi · Yunzong Xu 🔗 - Learning Value Functions from Undirected State-only Experience (Poster) This paper tackles the problem of learning value functions from undirected state-only experience (state transitions without action labels i.e. (s,s',r) tuples). We first theoretically characterize the applicability of Q-learning in this setting. We show that tabular Q-learning in discrete Markov decision processes (MDPs) learns the same value function under any arbitrary refinement of the action space. This theoretical result motivates the design of Latent Action Q-learning or LAQ, an offline RL method that can learn effective value functions from state-only experience. Latent Action Q-learning (LAQ) learns value functions using Q-learning on discrete latent actions obtained through a latent-variable future prediction model. We show that LAQ can recover value functions that have high correlation with value functions learned using ground truth actions. Value functions learned using LAQ lead to sample efficient acquisition of goal-directed behavior, can be used with domain-specific low-level controllers, and facilitate transfer across embodiments. Our experiments in 5 environments ranging from 2D grid world to 3D visual navigation in realistic environments demonstrate the benefits of LAQ over simpler alternatives, imitation learning oracles, and competing methods. Matthew Chang · Arjun Gupta · Saurabh Gupta 🔗 - Discriminator-Weighted Offline Imitation Learning from Suboptimal Demonstrations (Poster) We study the problem of offline Imitation Learning (IL) where an agent aims to learn an optimal expert behavior policy without additional online environment interactions. Instead, the agent is provided with a static offline dataset of state-action-next state transition triples from both optimal and non-optimal expert behaviors. This strictly offline imitation learning problem arises in many real-world problems, where environment interactions and expert annotations are costly. Prior works that address the problem either require that expert data occupies the majority proportion of the offline dataset, or need to learn a reward function and perform offline reinforcement learning (RL) based on the learned reward function. In this paper, we propose an imitation learning algorithm to address the problem without additional steps of reward learning and offline RL training for the case when demonstrations containing large proportion of suboptimal data. Built upon behavioral cloning (BC), we introduce an additional discriminator to distinguish expert and non-expert data, we propose a cooperation strategy to boost the performance of both tasks, this will result in a new policy learning objective and surprisingly, we find its equivalence to a generalized BC objective, where the outputs of discriminator serve as the weights of the BC loss function. Experimental results show that the proposed algorithm can learn behavior policies that are much closer to the optimal policies than policies learned by baseline algorithms. Haoran Xu · Xianyuan Zhan · Honglei Yin · 🔗 - Model-Based Offline Planning with Trajectory Pruning (Poster) Offline reinforcement learning (RL) enables learning policies using pre-collected datasets without environment interaction, which provides a promising direction to make RL usable in real-world systems. Although recent offline RL studies have achieved much progress, existing methods still face many practical challenges in real-world system control tasks, such as computational restriction during agent training and the requirement of extra control flexibility. Model-based planning framework provides an attractive solution for such tasks. However, most model-based planning algorithms are not designed for offline settings. Simply combining the ingredients of offline RL with existing methods either provides over-restrictive planning or leads to inferior performance. We propose a new light-weighted model-based offline planning framework, namely MOPP, which tackles the dilemma between the restrictions of offline learning and high-performance planning. MOPP encourages more aggressive trajectory rollout guided by the behavior policy learned from data, and prunes out problematic trajectories to avoid potential out-of-distribution samples. Experimental results show that MOPP provides competitive performance compared with existing model-based offline planning and RL approaches. Xianyuan Zhan · Xiangyu Zhu · Haoran Xu 🔗 - TRAIL: Near-Optimal Imitation Learning with Suboptimal Data (Poster) The aim in imitation learning is to learn effective policies by utilizing near-optimal expert demonstrations. However, high-quality demonstrations from human experts can be expensive to obtain in large number. On the other hand, it is often much easier to obtain large quantities of suboptimal or task-agnostic trajectories, which are not useful for direct imitation, but can nevertheless provide insight into the dynamical structure of the environment, showing what could be done in the environment even if not what should be done. Is it possible to formalize these conceptual benefits and devise algorithms to use offline datasets to yield provable improvements to the sample-efficiency of imitation learning? In this work, we study this question and present training objectives that use offline datasets to learn a factored transition model whose structure enables the extraction of a latent action space. Our theoretical analysis shows that the learned latent action space can boost the sample-efficiency of downstream imitation learning, effectively reducing the need for large near-optimal expert datasets through the use of auxiliary non-expert data. To learn the latent action space in practice, we propose TRAIL (Transition-Reparametrized Actions for Imitation Learning), an algorithm that learns an energy-based transition model contrastively, and uses the transition model to reparametrize the action space for sample-efficient imitation learning. We evaluate the practicality of our objective through experiments on a set of navigation and locomotion tasks. Our results verify the benefits suggested by our theory and show that TRAIL is able to recover near-optimal policies with fewer expert trajectories. Sherry Yang · Sergey Levine · Ofir Nachum 🔗 - Offline Meta-Reinforcement Learning for Industrial Insertion (Poster) Reinforcement learning (RL) can in principle make it possible for robots to automatically adapt to new tasks, but in practice current RL methods require a very large number of trials to accomplish this. In this paper, we tackle rapid adaptation to new tasks through the framework of meta-learning, which utilizes past tasks to learn to adapt, with a specific focus on industrial insertion tasks. We address two specific challenges by applying meta-learning in this setting. First, conventional meta-RL algorithms require lengthy online meta-training phases. We show that this can be replaced with appropriately chosen offline data, resulting in an offline meta-RL method that only requires demonstrations and trials from each of the prior tasks, without the need to run costly meta-RL procedures online. Second, meta-RL methods can fail to generalize to new tasks that are too different from those seen at meta-training time, which poses a particular challenge in industrial applications, where high success rates are critical. We address this by combining contextual meta-learning with direct online finetuning: if the new task is similar to those seen in the prior data, then the contextual meta-learner adapts immediately, and if it is too different, it gradually adapts through finetuning. We show that our approach is able to quickly adapt to a variety of different insertion tasks, learning how to perform them with a success rate of 100% using only a fraction of the samples needed for learning the tasks from scratch. Experiment videos and details are available at https://sites.google.com/view/oda-anon. Tony Zhao · Jianlan Luo · Oleg Sushkov · Rugile Pevceviciute · Nicolas Heess · Jonathan Scholz · Stefan Schaal · Sergey Levine 🔗 - Sim-to-Real Interactive Recommendation via Off-Dynamics Reinforcement Learning (Poster) Interactive recommender systems (IRS) have received growing attention due to its awareness of long-term engagement and dynamic preference. Although the long-term planning perspective of reinforcement learning (RL) naturally fits the IRS setup, RL methods require a large amount of online user interaction, which is restricted due to economic considerations. To train agents with limited interaction data, previous works often count on building simulators to mimic user behaviors in real systems. This poses potential challenges to the success of sim-to-real transfer. In practice, such transfer easily fails as user dynamics is highly unpredictable and sensitive to the type of recommendation task. To address the above issue, we propose a novel method, S2R-Rec, to bridge the sim-to-real gap via off-dynamics RL. Generally, we expect the policy learned by only interacting with the simulator can perform well in the real environment. To achieve this, we conduct dynamics adaptation to calibrate the difference of state transition using reward correction. Furthermore, we align representation discrepancy of items by representation adaptation. Instead of separating the above into two stages, we propose to jointly adapt the dynamics and representations, leading to a unified learning objective. Experiments on real-world datasets validate the superiority of our approach, which achieves about 33.18% improvements compared to the baselines. Junda Wu · Zhihui Xie · Tong Yu · Qizhi Li · Shuai Li 🔗 - Why so pessimistic? Estimating uncertainties for offline rl through ensembles, and why their independence matters (Poster) In offline/batch reinforcement learning (RL), the predominant class of approaches with most success have been `support constraint" methods, where trained policies are encouraged to remain within the support of the provided offline dataset. However, support constraints correspond to an overly pessimistic assumption that actions outside the provided data may lead to worst-case outcomes. In this work, we aim to relax this assumption by obtaining uncertainty estimates for predicted action values, and acting conservatively with respect to a lower-confidence bound (LCB) on these estimates. Motivated by the success of ensembles for uncertainty estimation in supervised learning, we propose MSG, an offline RL method that employs an ensemble of independently updated Q-functions. First, theoretically, by referring to the literature on infinite-width neural networks, we demonstrate the crucial dependence of the quality of derived uncertainties on the manner in which ensembling is performed, a phenomenon that arises due to the dynamic programming nature of RL and overlooked by existing offline RL methods. Our theoretical predictions are corroborated by pedagogical examples on toy MDPs, as well as empirical comparisons in benchmark continuous control domains. In the significantly more challenging antmaze domains of the D4RL benchmark, MSG with deep ensembles by a wide margin surpasses highly well-tuned state-of-the-art methods. Consequently, we investigate whether efficient approximations can be similarly effective. We demonstrate that while some very efficient variants also outperform current state-of-the-art, they do not match the performance and robustness of MSG with deep ensembles. We hope that the significant impact of our less pessimistic approach engenders increased focus into uncertainty estimation techniques directed at RL, and engenders new efforts from the community of deep network uncertainty estimation researchers. Seyed Kamyar Seyed Ghasemipour · Shixiang (Shane) Gu · Ofir Nachum 🔗 - Example-Based Offline Reinforcement Learning without Rewards (Poster) Offline reinforcement learning (RL) methods, which tackle the problem of learning a policy from a static dataset, have shown promise in deploying RL in real-world scenarios. Offline RL allows the re-use and accumulation of large datasets while mitigating safety concerns that arise in online exploration. However, prior works require human-defined reward labels to learn from offline datasets. Reward specification remains a major challenge for deep RL algorithms and also poses an issue for offline RL in the real world since designing reward functions could take considerable manual effort and also potentially requires installing extra hardware such as visual sensors on robots to detect the completion of a task. In contrast, in many settings, it is easier for users to provide examples of a completed task such as images than specifying a complex reward function. Based on this observation, we propose an algorithm that can learn behaviors from offline datasets without reward labels, instead of using a small number of example images. Our method learns a conservative classifier that directly learns a Q-function from the offline dataset and the successful examples while penalizing the Q-values to prevent distributional shift. Through extensive empirical results, we find that our method outperforms prior imitation learning algorithms and inverse RL methods by 53% that directly learn rewards in vision-based robot manipulation domains Kyle Hatch · Tianhe Yu · Rafael Rafailov · Chelsea Finn 🔗 - The Reflective Explorer: Online Meta-Exploration from Offline Data in Realistic Robotic Tasks (Poster) Reinforcement learning is difficult to apply to real world problems due to high sample complexity, the need to adapt to frequent distribution shifts and the complexities of learning from high-dimensional inputs, such as images. Over the last several years, meta-learning has emerged as a promising approach to tackle these problems by explicitly training an agent to quickly adapt to new tasks. However, such methods still require huge amounts of data during training and are difficult to optimize in high-dimensional domains. One potential solution is to consider offline or batch meta-reinforcement learning (RL) - learning from existing datasets without additional environment interactions during training. In this work we develop the first offline model-based meta-RL algorithm that operates from images in tasks with sparse rewards. Our approach has three main components: a novel strategy to construct meta-exploration trajectories from offline data, which allows agents to learn meaningful meta-test time task inference strategy; representation learning via variational filtering and latent conservative model-free policy optimization. We show that our method completely solves a realistic meta-learning task involving robot manipulation, while naive combinations of previous approaches fail. Rafael Rafailov · · Tianhe Yu · Avi Singh · Mariano Phielipp · Chelsea Finn 🔗

#### Author Information

##### Rishabh Agarwal (Google Research, Brain Team)

My research work mainly revolves around deep reinforcement learning (RL), often with the goal of making RL methods suitable for real-world problems, and includes an outstanding paper award at NeurIPS.