Timezone: »
Discover how to improve the adoption of RL in practice, by discussing key research problems, SOTA, and success stories / insights / lessons w.r.t. practical RL algorithms, practical issues, and applications with leading experts from both academia and industry @ NeurIPS 2022 RL4RealLife workshop.
Sat 5:30 a.m. - 6:25 a.m.
|
posters (for early birds, optional)
(
posters
)
|
🔗 |
Sat 6:25 a.m. - 6:30 a.m.
|
opening remarks
SlidesLive Video » |
🔗 |
Sat 6:31 a.m. - 7:00 a.m.
|
Invited talk: Outracing Champion Gran Turismo Drivers with Deep Reinforcement Learning
(
talk
)
link »
SlidesLive Video » Many potential applications of artificial intelligence involve making real-time decisions in physical systems while interacting with humans. Automobile racing represents an extreme example of these conditions; drivers must execute complex tactical manoeuvres to pass or block opponents while operating their vehicles at their traction limits. Racing simulations, such as the PlayStation game Gran Turismo, faithfully reproduce the non-linear control challenges of real race cars while also encapsulating the complex multi-agent interactions. Here we describe how we trained agents for Gran Turismo that can compete with the world's best e-sports drivers. We combine state-of-the-art, model-free, deep reinforcement learning algorithms with mixed-scenario training to learn an integrated control policy that combines exceptional speed with impressive tactics. In addition, we construct a reward function that enables the agent to be competitive while adhering to racing's important, but under-specified, sportsmanship rules. We demonstrate the capabilities of our agent, Gran Turismo Sophy, by winning a head-to-head competition against four of the world's best Gran Turismo drivers. By describing how we trained championship-level racers, we demonstrate the possibilities and challenges of using these techniques to control complex dynamical systems in domains where agents must respect imprecisely defined human norms. Bio: Dr. Peter Stone holds the Truchard Foundation Chair in Computer Science at the University of Texas at Austin. He is Associate Chair of the Computer Science Department, as well as Director of Texas Robotics. In 2013 he was awarded the University of Texas System Regents' Outstanding Teaching Award and in 2014 he was inducted into the UT Austin Academy of Distinguished Teachers, earning him the title of University Distinguished Teaching Professor. Professor Stone's research interests in Artificial Intelligence include machine learning (especially reinforcement learning), multiagent systems, and robotics. Professor Stone received his Ph.D in Computer Science in 1998 from Carnegie Mellon University. From 1999 to 2002 he was a Senior Technical Staff Member in the Artificial Intelligence Principles Research Department at AT&T Labs - Research. He is an Alfred P. Sloan Research Fellow, Guggenheim Fellow, AAAI Fellow, IEEE Fellow, AAAS Fellow, ACM Fellow, Fulbright Scholar, and 2004 ONR Young Investigator. In 2007 he received the prestigious IJCAI Computers and Thought Award, given biannually to the top AI researcher under the age of 35, and in 2016 he was awarded the ACM/SIGAI Autonomous Agents Research Award. Professor Stone co-founded Cogitai, Inc., a startup company focused on continual learning, in 2015, and currently serves as Executive Director of Sony AI America. |
Peter Stone 🔗 |
Sat 7:01 a.m. - 7:30 a.m.
|
Invited talk: Scaling reinforcement learning in the real world, from gaming to finance to manufacturing
(
talk
)
SlidesLive Video » Reinforcement learning is transforming industries from gaming to robotics to manufacturing. This talk showcases how a variety of industries are adopting reinforcement learning to overhaul their businesses, from changing the nature of game development to designing the boat that won the America's Cup. These industries leverage Ray, a distributed framework for scaling Python applications and machine learning applications. Ray is used by companies across the board from Uber to OpenAI to Shopify to Amazon to scale their machine learning training, inference, data ingest, and reinforcement learning workloads. Bio: Robert Nishihara is one of the creators of Ray, a distributed framework for scaling Python applications and machine learning applications. Ray is used by companies across the board from Uber to OpenAI to Shopify to Amazon to scale their machine learning training, inference, data ingest, and reinforcement learning workloads. He is one of the co-founders and CEO of Anyscale, which is the company behind Ray. He did his PhD in machine learning and distributed systems in the computer science department at UC Berkeley. Before that, he majored in math at Harvard. |
Robert Nishihara 🔗 |
Sat 7:30 a.m. - 7:31 a.m.
|
Intro speaker
(
In-person Intro
)
|
🔗 |
Sat 7:31 a.m. - 8:00 a.m.
|
Invited talk: Deep Reinforcement Learning for Real-World Inventory Management
(
talk
)
SlidesLive Video » We present a Deep Reinforcement Learning approach to solving a periodic review inventory control system with stochastic vendor lead times, lost sales, correlated demand, and price matching. While this dynamic program has historically been considered intractable, we show that several policy learning approaches are competitive with or outperform classical baseline approaches. In order to train these algorithms, we develop novel techniques to convert historical data into a simulator and present a collection of results that motivate this approach. We also present a model-based reinforcement learning procedure (Direct Backprop) to solve the dynamic periodic review inventory control problem by constructing a differentiable simulator. Under a variety of metrics Direct Backprop outperforms model-free RL and newsvendor baselines, in both simulations and real-world deployments. Bio: Dhruv Madeka is a Principal Machine Learning Scientist at Amazon. His current research focuses on applying Deep Reinforcement Learning to supply chain problems. Dhruv has also worked on developing generative and supervised deep learning models for probabilistic time series forecasting. In the past - Dhruv worked in the Quantitative Research team at Bloomberg LP, developing open source tools for the Jupyter ecosystem and conducting advanced mathematical research in derivatives pricing, quantitative finance and election forecasting. |
Dhruv Madeka 🔗 |
Sat 8:00 a.m. - 8:20 a.m.
|
Coffee break
|
🔗 |
Sat 8:20 a.m. - 9:10 a.m.
|
Panel RL Implementation
(
Panel
)
SlidesLive Video » |
Xiaolin Ge · Alborz Geramifard · Kence Anderson · Craig Buhr · Robert Nishihara · Yuandong Tian 🔗 |
Sat 9:10 a.m. - 10:00 a.m.
|
Panel RL Benchmarks
(
Panel
)
SlidesLive Video » |
Minmin Chen · Pablo Samuel Castro · Caglar Gulcehre · Tony Jebara · Peter Stone 🔗 |
Sat 10:00 a.m. - 11:30 a.m.
|
Lunch Break / Posters
(
Poster/Break
)
|
🔗 |
Sat 11:31 a.m. - 12:00 p.m.
|
Invited talk AlphaTensor: Discovering faster matrix multiplication algorithms with RL
(
talk
)
SlidesLive Video » Improving the efficiency of algorithms for fundamental computational tasks such as matrix multiplication can have widespread impact, as it affects the overall speed of a large amount of computations. Automatic discovery of algorithms using ML offers the prospect of reaching beyond human intuition and outperforming the current best human-designed algorithms. In this talk I’ll present AlphaTensor, our RL agent based on AlphaZero for discovering efficient and provably correct algorithms for the multiplication of arbitrary matrices. AlphaTensor discovered algorithms that outperform the state-of-the-art complexity for many matrix sizes. Particularly relevant is the case of 4 × 4 matrices in a finite field, where AlphaTensor’s algorithm improves on Strassen’s two-level algorithm for the first time since its discovery 50 years ago. I’ll present our problem formulation as a single-player game, the key ingredients that enable tackling such difficult mathematical problems using RL, and the flexibility of the AlphaTensor framework. Bio: Matej Balog is a Senior Research Scientist at DeepMind, working in the Science team on applications of AI to Maths and Computation. Prior to joining DeepMind he worked on program synthesis and understanding, and was a PhD student at the University of Cambridge with Zoubin Ghahramani, working on general machine learning methodology, in particular on conversions between fundamental computational tasks such as integration, sampling, optimization, and search. |
Matej Balog 🔗 |
Sat 12:00 p.m. - 12:55 p.m.
|
Panel RL Theory-Practice Gap
(
Panel
)
SlidesLive Video » |
Peter Stone · Matej Balog · Jonas Buchli · Jason Gauci · Dhruv Madeka 🔗 |
Sat 12:55 p.m. - 1:00 p.m.
|
closing remarks
|
🔗 |
Sat 1:00 p.m. - 1:30 p.m.
|
Coffee break / Posters
|
🔗 |
Sat 1:30 p.m. - 3:00 p.m.
|
Posters
|
🔗 |
-
|
An Empirical Evaluation of Posterior Sampling for Constrained Reinforcement Learning
(
Poster
)
We study a posterior sampling approach to efficient exploration in constrained reinforcement learning. Alternatively to existing algorithms, we propose two simple algorithms that are more efficient statistically, simpler to implement and computationally cheaper. The first algorithm is based on a linear formulation of CMDP, and the second algorithm leverages the saddle-point formulation of CMDP. Our empirical results demonstrate that, despite its simplicity, posterior sampling achieves state-of-the-art performance and, in some cases, significantly outperforms optimistic algorithms. |
Danil Provodin · Pratik Gajane · Mykola Pechenizkiy · Maurits Kaptein 🔗 |
-
|
An Empirical Evaluation of Posterior Sampling for Constrained Reinforcement Learning
(
Spotlight
)
SlidesLive Video » We study a posterior sampling approach to efficient exploration in constrained reinforcement learning. Alternatively to existing algorithms, we propose two simple algorithms that are more efficient statistically, simpler to implement and computationally cheaper. The first algorithm is based on a linear formulation of CMDP, and the second algorithm leverages the saddle-point formulation of CMDP. Our empirical results demonstrate that, despite its simplicity, posterior sampling achieves state-of-the-art performance and, in some cases, significantly outperforms optimistic algorithms. |
Danil Provodin · Pratik Gajane · Mykola Pechenizkiy · Maurits Kaptein 🔗 |
-
|
MARLIM: Multi-Agent Reinforcement Learning for Inventory Management
(
Poster
)
SlidesLive Video » Maintaining a balance between the supply and demand of products by optimizing replenishment decisions is one of the most important challenges in the supply chain industry. This paper presents a novel reinforcement learning framework called MARLIM, to address the inventory management problem for a single-echelon multi-products supply chain with stochastic demands and lead-times. Within this context, controllers are developed through single or multiple agents in a cooperative setting. Numerical experiments on real data demonstrate the benefits of reinforcement learning methods over traditional baselines. |
Rémi Leluc · Elie Kadoche · Antoine Bertoncello · Sébastien Gourvénec 🔗 |
-
|
MARLIM: Multi-Agent Reinforcement Learning for Inventory Management
(
Spotlight
)
Maintaining a balance between the supply and demand of products by optimizing replenishment decisions is one of the most important challenges in the supply chain industry. This paper presents a novel reinforcement learning framework called MARLIM, to address the inventory management problem for a single-echelon multi-products supply chain with stochastic demands and lead-times. Within this context, controllers are developed through single or multiple agents in a cooperative setting. Numerical experiments on real data demonstrate the benefits of reinforcement learning methods over traditional baselines. |
Rémi Leluc · Elie Kadoche · Antoine Bertoncello · Sébastien Gourvénec 🔗 |
-
|
A Versatile and Efficient Reinforcement Learning Approach for Autonomous Driving
(
Poster
)
Heated debates continue over the best solution for autonomous driving. The classic modular pipeline is widely adopted in the industry owing to its great interpretability and stability, whereas the fully end-to-end paradigm has demonstrated considerable simplicity and learnability along with the rise of deep learning. As a way of marrying the advantages of both approaches, learning a semantically meaningful representation and then using it in the downstream driving policy learning tasks provides a viable and attractive solution. However, several key challenges remain to be addressed, including identifying the most effective representation, alleviating the sim-to-real generalization issue as well as balancing model training cost. In this study, we propose a versatile and efficient reinforcement learning approach and build a fully functional autonomous vehicle for real-world validation. Our method shows great generalizability to various complicated real-world scenarios and superior training efficiency against the competing baselines. |
Guan Wang · Haoyi Niu · desheng zhu · Jianming HU · Xianyuan Zhan · Guyue Zhou 🔗 |
-
|
A Versatile and Efficient Reinforcement Learning Approach for Autonomous Driving
(
Spotlight
)
SlidesLive Video » Heated debates continue over the best solution for autonomous driving. The classic modular pipeline is widely adopted in the industry owing to its great interpretability and stability, whereas the fully end-to-end paradigm has demonstrated considerable simplicity and learnability along with the rise of deep learning. As a way of marrying the advantages of both approaches, learning a semantically meaningful representation and then using it in the downstream driving policy learning tasks provides a viable and attractive solution. However, several key challenges remain to be addressed, including identifying the most effective representation, alleviating the sim-to-real generalization issue as well as balancing model training cost. In this study, we propose a versatile and efficient reinforcement learning approach and build a fully functional autonomous vehicle for real-world validation. Our method shows great generalizability to various complicated real-world scenarios and superior training efficiency against the competing baselines. |
Guan Wang · Haoyi Niu · desheng zhu · Jianming HU · Xianyuan Zhan · Guyue Zhou 🔗 |
-
|
Semi-analytical Industrial Cooling System Model for Reinforcement Learning
(
Poster
)
SlidesLive Video » We present a hybrid industrial cooling system model that embeds analytical solutions within a multiphysics simulation. This model is designed for reinforcement learning (RL) applications and balances simplicity with simulation fidelity and interpretability. The model’s fidelity is evaluated against real world data from a large scale cooling system. This is followed by a case study illustrating how themodel can be used for RL research. For this, we develop an industrial task suite that allows specifying different problem settings and levels of complexity, and use it to evaluate the performance of different RL algorithms. |
Yuri Chervonyi · Praneet Dutta 🔗 |
-
|
Semi-analytical Industrial Cooling System Model for Reinforcement Learning
(
Spotlight
)
We present a hybrid industrial cooling system model that embeds analytical solutions within a multiphysics simulation. This model is designed for reinforcement learning (RL) applications and balances simplicity with simulation fidelity and interpretability. The model’s fidelity is evaluated against real world data from a large scale cooling system. This is followed by a case study illustrating how themodel can be used for RL research. For this, we develop an industrial task suite that allows specifying different problem settings and levels of complexity, and use it to evaluate the performance of different RL algorithms. |
Yuri Chervonyi · Praneet Dutta 🔗 |
-
|
Structured Q-learning For Antibody Design
(
Poster
)
SlidesLive Video »
Optimizing combinatorial structures is core to many real-world problems, such as those encountered in life sciences. For example, one of the crucial steps involved in antibody design is to find an arrangement of amino acids in a protein sequence that improves its binding with a pathogen. Combinatorial optimization of antibodies is difficult due to extremely large search spaces and non-linear objectives. Even for modest antibody design problems, where proteins have a sequence length of eleven, we are faced with searching over $2.05 \times 10^{14}$ structures. Applying traditional Reinforcement Learning algorithms such as Q-learning to combinatorial optimization results in poor performance. We propose Structured Q-learning (SQL), an extension of Q-learning that incorporates structural priors for combinatorial optimization. Using a molecular docking simulator, we demonstrate that SQL finds high binding energy sequences and performs favourably against baselines on eight challenging antibody design tasks, including designing antibodies for SARS-COV.
|
Alexander Cowen-Rivers · Philip John Gorinski · aivar sootla · Asif Khan · Jun WANG · Jan Peters · Haitham Bou Ammar 🔗 |
-
|
Structured Q-learning For Antibody Design
(
Spotlight
)
Optimizing combinatorial structures is core to many real-world problems, such as those encountered in life sciences. For example, one of the crucial steps involved in antibody design is to find an arrangement of amino acids in a protein sequence that improves its binding with a pathogen. Combinatorial optimization of antibodies is difficult due to extremely large search spaces and non-linear objectives. Even for modest antibody design problems, where proteins have a sequence length of eleven, we are faced with searching over $2.05 \times 10^{14}$ structures. Applying traditional Reinforcement Learning algorithms such as Q-learning to combinatorial optimization results in poor performance. We propose Structured Q-learning (SQL), an extension of Q-learning that incorporates structural priors for combinatorial optimization. Using a molecular docking simulator, we demonstrate that SQL finds high binding energy sequences and performs favourably against baselines on eight challenging antibody design tasks, including designing antibodies for SARS-COV.
|
Alexander Cowen-Rivers · Philip John Gorinski · aivar sootla · Asif Khan · Jun WANG · Jan Peters · Haitham Bou Ammar 🔗 |
-
|
Hierarchical Reinforcement Learning for Furniture Layout in Virtual Indoor Scenes
(
Poster
)
SlidesLive Video » In real life, the decoration of 3D indoor scenes through designing furniture layoutprovides a rich experience for people. In this paper, we explore the furniturelayout task as a Markov decision process (MDP) in virtual reality, which is solvedby hierarchical reinforcement learning (HRL). The goal is to produce a propertwo-furniture layout in the virtual reality of the indoor scenes. In particular, wefirst design a simulation environment and introduce the HRL formulation for atwo-furniture layout. We then apply a hierarchical actor-critic algorithm withcurriculum learning to solve the MDP. We conduct our experiments on a large-scalereal-world interior layout dataset that contains industrial designs from professionaldesigners. Our numerical results demonstrate that the proposed model yieldshigher-quality layouts as compared with the state-of-art models. |
Xinhan Di · Pengqian Yu 🔗 |
-
|
Hierarchical Reinforcement Learning for Furniture Layout in Virtual Indoor Scenes
(
Spotlight
)
In real life, the decoration of 3D indoor scenes through designing furniture layoutprovides a rich experience for people. In this paper, we explore the furniturelayout task as a Markov decision process (MDP) in virtual reality, which is solvedby hierarchical reinforcement learning (HRL). The goal is to produce a propertwo-furniture layout in the virtual reality of the indoor scenes. In particular, wefirst design a simulation environment and introduce the HRL formulation for atwo-furniture layout. We then apply a hierarchical actor-critic algorithm withcurriculum learning to solve the MDP. We conduct our experiments on a large-scalereal-world interior layout dataset that contains industrial designs from professionaldesigners. Our numerical results demonstrate that the proposed model yieldshigher-quality layouts as compared with the state-of-art models. |
Xinhan Di · Pengqian Yu 🔗 |
-
|
Learning an Adaptive Forwarding Strategy for Mobile Wireless Networks: Resource Usage vs. Latency
(
Poster
)
SlidesLive Video » Mobile wireless networks present several challenges for any learning system, due to uncertain and variable device movement, a decentralized network architecture, and constraints on network resources. In this work, we use deep reinforcement learning (DRL) to learn a scalable and generalizable forwarding strategy for such networks. We make the following contributions: i) we use hierarchical RL to design DRL packet agents rather than device agents, to capture the packet forwarding decisions that are made over time and improve training efficiency; ii) we use relational features to ensure generalizeability of the learned forwarding strategy to a wide range of network dynamics and enable offline training; and iii) we design the DRL reward function to reflect both the packet forwarding goals and the resource considerations of the network; and we incorporate both the forwarding goals and network resource considerations into packet decision-making by designing a weighted DRL reward function. Our results show that our DRL agent often achieves a similar delay per packet delivered as the optimal forwarding strategy and outperforms all other strategies including state-of-the art strategies, even on scenarios on which the DRL agent was not trained. |
Victoria Manfredi · Alicia Wolfe · Xiaolan Zhang · Bing Wang 🔗 |
-
|
Learning an Adaptive Forwarding Strategy for Mobile Wireless Networks: Resource Usage vs. Latency
(
Spotlight
)
SlidesLive Video » Mobile wireless networks present several challenges for any learning system, due to uncertain and variable device movement, a decentralized network architecture, and constraints on network resources. In this work, we use deep reinforcement learning (DRL) to learn a scalable and generalizable forwarding strategy for such networks. We make the following contributions: i) we use hierarchical RL to design DRL packet agents rather than device agents, to capture the packet forwarding decisions that are made over time and improve training efficiency; ii) we use relational features to ensure generalizeability of the learned forwarding strategy to a wide range of network dynamics and enable offline training; and iii) we design the DRL reward function to reflect both the packet forwarding goals and the resource considerations of the network; and we incorporate both the forwarding goals and network resource considerations into packet decision-making by designing a weighted DRL reward function. Our results show that our DRL agent often achieves a similar delay per packet delivered as the optimal forwarding strategy and outperforms all other strategies including state-of-the art strategies, even on scenarios on which the DRL agent was not trained. |
Victoria Manfredi · Alicia Wolfe · Xiaolan Zhang · Bing Wang 🔗 |
-
|
Safe Reinforcement Learning for Automatic Insulin Delivery in Type I Diabetes
(
Poster
)
SlidesLive Video » Despite promising performances, reinforcement learning (RL) is only rarely appliedwhen a high level of risk is implied. Glycemia control in type I diabetes is onesuch example: a variety of RL agents have been shown to accurately regulateinsulin delivery and yet no real life application can be seen. For such applications,managing risk is the key. In this paper, we use the evolution strategies algorithmto train a policy network for glycemia control: it has state-of-the-arts results,and recovers, without any a priori knowledge, the basics of insulin therapy andblood sugar management. We propose a way to equip the policy network withan epistemic uncertainty measure which requires no further model training. Weillustrate how this epistemic uncertainty estimate can be used to improve the safetyof the device, paving the way for real life clinical trials. |
Maxime Louis · Hector Romero Ugalde · Pierre Gauthier · Alice Adenis · Yousra Tourki · Erik Huneker 🔗 |
-
|
Safe Reinforcement Learning for Automatic Insulin Delivery in Type I Diabetes
(
Spotlight
)
Despite promising performances, reinforcement learning (RL) is only rarely appliedwhen a high level of risk is implied. Glycemia control in type I diabetes is onesuch example: a variety of RL agents have been shown to accurately regulateinsulin delivery and yet no real life application can be seen. For such applications,managing risk is the key. In this paper, we use the evolution strategies algorithmto train a policy network for glycemia control: it has state-of-the-arts results,and recovers, without any a priori knowledge, the basics of insulin therapy andblood sugar management. We propose a way to equip the policy network withan epistemic uncertainty measure which requires no further model training. Weillustrate how this epistemic uncertainty estimate can be used to improve the safetyof the device, paving the way for real life clinical trials. |
Maxime Louis · Hector Romero Ugalde · Pierre Gauthier · Alice Adenis · Yousra Tourki · Erik Huneker 🔗 |
-
|
Power Grid Congestion Management via Topology Optimization with AlphaZero
(
Poster
)
The energy sector is facing rapid changes in the transition towards clean renewable sources. However, the growing share of volatile, fluctuating renewable generation such as wind or solar energy has already led to an increase in power grid congestion and network security concerns. Grid operators mitigate these by modifying either generation or demand (redispatching, curtailment, flexible loads). Unfortunately, redispatching of fossil generators leads to excessive grid operation costs and higher emissions, which is in direct opposition to the decarbonization of the energy sector. In this paper, we propose an AlphaZero-based grid topology optimization agent as a non-costly, carbon-free congestion management alternative. Our experimental evaluation confirms the potential of topology optimization for power grid operation, achieves a reduction of the average amount of required redispatching by 60\% and shows the interoperability with traditional congestion management methods. Based on our findings, we identify and discuss open research problems as well as technical challenges for a productive system on a real power grid. |
Matthias Dorfer · Anton R. Fuxjaeger · Kristián Kozák · Patrick Blies · Marcel Wasserer 🔗 |
-
|
Power Grid Congestion Management via Topology Optimization with AlphaZero
(
Spotlight
)
SlidesLive Video » The energy sector is facing rapid changes in the transition towards clean renewable sources. However, the growing share of volatile, fluctuating renewable generation such as wind or solar energy has already led to an increase in power grid congestion and network security concerns. Grid operators mitigate these by modifying either generation or demand (redispatching, curtailment, flexible loads). Unfortunately, redispatching of fossil generators leads to excessive grid operation costs and higher emissions, which is in direct opposition to the decarbonization of the energy sector. In this paper, we propose an AlphaZero-based grid topology optimization agent as a non-costly, carbon-free congestion management alternative. Our experimental evaluation confirms the potential of topology optimization for power grid operation, achieves a reduction of the average amount of required redispatching by 60\% and shows the interoperability with traditional congestion management methods. Based on our findings, we identify and discuss open research problems as well as technical challenges for a productive system on a real power grid. |
Matthias Dorfer · Anton R. Fuxjaeger · Kristián Kozák · Patrick Blies · Marcel Wasserer 🔗 |
-
|
Multi-Agent Reinforcement Learning with Shared Resources for Inventory Management
(
Poster
)
SlidesLive Video » In this paper, we consider the inventory management~(IM) problem where we need to make replenishment decisions for a large number of stock keeping units (SKUs) to balance their supply and demand. In our setting, the constraint on the shared resources (such as the inventory capacity) couples the otherwise independent control for each SKU. We formulate the problem with this structure as Shared-Resource Stochastic Game (SRSG) and propose an efficient algorithm called Context-aware Decentralized PPO (CD-PPO). Through extensive experiments, we demonstrate that CD-PPO can accelerate the learning procedure compared with standard MARL algorithms. |
Yuandong Ding · Mingxiao Feng · Guozi Liu · Wei Jiang · Chuheng Zhang · Li Zhao · Lei Song · Houqiang Li · Yan Jin · Jiang Bian 🔗 |
-
|
Multi-Agent Reinforcement Learning with Shared Resources for Inventory Management
(
Spotlight
)
SlidesLive Video » In this paper, we consider the inventory management~(IM) problem where we need to make replenishment decisions for a large number of stock keeping units (SKUs) to balance their supply and demand. In our setting, the constraint on the shared resources (such as the inventory capacity) couples the otherwise independent control for each SKU. We formulate the problem with this structure as Shared-Resource Stochastic Game (SRSG) and propose an efficient algorithm called Context-aware Decentralized PPO (CD-PPO). Through extensive experiments, we demonstrate that CD-PPO can accelerate the learning procedure compared with standard MARL algorithms. |
Yuandong Ding · Mingxiao Feng · Guozi Liu · Wei Jiang · Chuheng Zhang · Li Zhao · Lei Song · Houqiang Li · Yan Jin · Jiang Bian 🔗 |
-
|
LibSignal: An Open Library for Traffic Signal Control
(
Poster
)
SlidesLive Video » This paper introduces a library for cross-simulator comparison of reinforcement learning models in traffic signal control tasks. This library is developed to implement recent state-of-the-art reinforcement learning models with extensible interfaces and unified cross-simulator evaluation metrics. It supports commonly-used simulators in traffic signal control tasks, including Simulation of Urban Mobility(SUMO) and CityFlow, and multiple benchmark datasets for fair comparisons. We conducted experiments to validate our implementation of the models and to calibrate the simulators so that the experiments from one simulator could be referential to the other. Based on the validated models and calibrated environments, this paper compares and reports the performance of current state-of-the-art RL algorithms across different datasets and simulators. This is the first time that these methods have been compared fairly under the same datasets with different simulators. |
Hao Mei · Xiaoliang Lei · Longchao Da · Bin Shi · Hua Wei 🔗 |
-
|
LibSignal: An Open Library for Traffic Signal Control
(
Spotlight
)
This paper introduces a library for cross-simulator comparison of reinforcement learning models in traffic signal control tasks. This library is developed to implement recent state-of-the-art reinforcement learning models with extensible interfaces and unified cross-simulator evaluation metrics. It supports commonly-used simulators in traffic signal control tasks, including Simulation of Urban Mobility(SUMO) and CityFlow, and multiple benchmark datasets for fair comparisons. We conducted experiments to validate our implementation of the models and to calibrate the simulators so that the experiments from one simulator could be referential to the other. Based on the validated models and calibrated environments, this paper compares and reports the performance of current state-of-the-art RL algorithms across different datasets and simulators. This is the first time that these methods have been compared fairly under the same datasets with different simulators. |
Hao Mei · Xiaoliang Lei · Longchao Da · Bin Shi · Hua Wei 🔗 |
-
|
Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs
(
Poster
)
SlidesLive Video »
Cloud datacenters are exponentially growing both in numbers and size. This increase results in a network activity surge that warrants better congestion avoidance. The resulting challenge is two-fold: (i) designing algorithms that can be custom-tuned to the complex traffic patterns of a given datacenter; but, at the same time (ii) run on low-level hardware with the required low latency of effective Congestion Control (CC). In this work, we present a Reinforcement Learning (RL) based CC solution that learns from certain traffic scenarios and successfully generalizes to others. We then distill the RL neural network policy into binary decision trees to achieve the desired $\mu$sec decision latency required for real-time inference. We deploy the distilled policy on NVIDIA NICs in a real network and demonstrate state-of-the-art performance, balancing all tested metrics simultaneously: bandwidth, latency, fairness, and drops.
|
Benjamin Fuhrer · Yuval Shpigelman · Chen Tessler · Shie Mannor · Gal Chechik · Eitan Zahavi · Gal Dalal 🔗 |
-
|
Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs
(
Spotlight
)
Cloud datacenters are exponentially growing both in numbers and size. This increase results in a network activity surge that warrants better congestion avoidance. The resulting challenge is two-fold: (i) designing algorithms that can be custom-tuned to the complex traffic patterns of a given datacenter; but, at the same time (ii) run on low-level hardware with the required low latency of effective Congestion Control (CC). In this work, we present a Reinforcement Learning (RL) based CC solution that learns from certain traffic scenarios and successfully generalizes to others. We then distill the RL neural network policy into binary decision trees to achieve the desired $\mu$sec decision latency required for real-time inference. We deploy the distilled policy on NVIDIA NICs in a real network and demonstrate state-of-the-art performance, balancing all tested metrics simultaneously: bandwidth, latency, fairness, and drops.
|
Benjamin Fuhrer · Yuval Shpigelman · Chen Tessler · Shie Mannor · Gal Chechik · Eitan Zahavi · Gal Dalal 🔗 |
-
|
Provably Efficient Reinforcement Learning for Online Adaptive Influence Maximization
(
Poster
)
Online influence maximization aims to maximize the influence spread of a content in a social network with unknown network model by selecting a few seed nodes. Recent studies followed a non-adaptive setting, where the seed nodes are selected before the start of the diffusion process and network parameters are updated when the diffusion stops. We consider an adaptive version of content-dependent online influence maximization problem where the seed nodes are sequentially activated based on real-time feedback. In this paper, we formulate the problem as an infinite-horizon discounted MDP under a linear diffusion process and present a model-based reinforcement learning solution. Our algorithm maintains a network model estimate and selects seed users adaptively, exploring the social network while improving the optimal policy optimistically. We establish $\widetilde \gO(\sqrt{T})$ regret bound for our algorithm. Empirical evaluations on synthetic and real-world networks demonstrate the efficiency of our algorithm.
|
Kaixuan Huang · Yu Wu · Xuezhou Zhang · Shenyinying Tu · Qingyun Wu · Mengdi Wang · Huazheng Wang 🔗 |
-
|
Provably Efficient Reinforcement Learning for Online Adaptive Influence Maximization
(
Spotlight
)
SlidesLive Video »
Online influence maximization aims to maximize the influence spread of a content in a social network with unknown network model by selecting a few seed nodes. Recent studies followed a non-adaptive setting, where the seed nodes are selected before the start of the diffusion process and network parameters are updated when the diffusion stops. We consider an adaptive version of content-dependent online influence maximization problem where the seed nodes are sequentially activated based on real-time feedback. In this paper, we formulate the problem as an infinite-horizon discounted MDP under a linear diffusion process and present a model-based reinforcement learning solution. Our algorithm maintains a network model estimate and selects seed users adaptively, exploring the social network while improving the optimal policy optimistically. We establish $\widetilde \gO(\sqrt{T})$ regret bound for our algorithm. Empirical evaluations on synthetic and real-world networks demonstrate the efficiency of our algorithm.
|
Kaixuan Huang · Yu Wu · Xuezhou Zhang · Shenyinying Tu · Qingyun Wu · Mengdi Wang · Huazheng Wang 🔗 |
-
|
Pareto-Optimal Diagnostic Policy Learning in Clinical Applications via Semi-Model-Based Deep Reinforcement Learning
(
Poster
)
SlidesLive Video » Dynamic diagnosis is desirable when medical tests are costly or time-consuming. In this work, we use reinforcement learning (RL) to find a dynamic policy that selects labtest panels sequentially based on previous observations, ensuring accurate testing at a low cost. Clinical diagnostic data are often highly imbalanced; therefore, we aim to maximize the F1 score directly instead of the error rate. However, the F1 score cannot be written as a cumulative sum of rewards, which invalidates standard RL methods. To remedy this issue, we develop a reward-shaping approach, leveraging properties of the F1 score and duality of policy optimization, to provably find the set of all Pareto-optimal policies for budget-constrained F1 score maximization. To handle the combinatorially complex state space, we propose a Semi-Model-based Deep Diagnosis Policy Optimization (SM-DDPO) framework that is compatible with end-to-end training and online learning. SM-DDPO is tested on clinical tasks: ferritin prediction, sepsis prevention, and acute kidney injury diagnosis. Experiments with real-world data validate that SM-DDPO trains efficiently and identify all Pareto-front solutions. Across all three tasks, SM-DDPO is able to achieve state-of-the-art diagnosis accuracy (in some cases higher than conventional methods) with up to 85% reduction in testing cost. |
zheng Yu · Yikuan Li · Joseph Kim · Kaixuan Huang · Yuan Luo · Mengdi Wang 🔗 |
-
|
Pareto-Optimal Diagnostic Policy Learning in Clinical Applications via Semi-Model-Based Deep Reinforcement Learning
(
Spotlight
)
Dynamic diagnosis is desirable when medical tests are costly or time-consuming. In this work, we use reinforcement learning (RL) to find a dynamic policy that selects labtest panels sequentially based on previous observations, ensuring accurate testing at a low cost. Clinical diagnostic data are often highly imbalanced; therefore, we aim to maximize the F1 score directly instead of the error rate. However, the F1 score cannot be written as a cumulative sum of rewards, which invalidates standard RL methods. To remedy this issue, we develop a reward-shaping approach, leveraging properties of the F1 score and duality of policy optimization, to provably find the set of all Pareto-optimal policies for budget-constrained F1 score maximization. To handle the combinatorially complex state space, we propose a Semi-Model-based Deep Diagnosis Policy Optimization (SM-DDPO) framework that is compatible with end-to-end training and online learning. SM-DDPO is tested on clinical tasks: ferritin prediction, sepsis prevention, and acute kidney injury diagnosis. Experiments with real-world data validate that SM-DDPO trains efficiently and identify all Pareto-front solutions. Across all three tasks, SM-DDPO is able to achieve state-of-the-art diagnosis accuracy (in some cases higher than conventional methods) with up to 85% reduction in testing cost. |
zheng Yu · Yikuan Li · Joseph Kim · Kaixuan Huang · Yuan Luo · Mengdi Wang 🔗 |
-
|
tinyMAN: Lightweight Energy Manager using Reinforcement Learning for Energy Harvesting Wearable IoT Devices
(
Poster
)
SlidesLive Video » Advances in low-power electronics and machine learning techniques lead to many novel wearable IoT devices. These devices have limited battery capacity and computational power.Thus, energy harvesting from ambient sources is a promising solution to power these low-energy wearable devices. They need to manage the harvested energy optimally to achieve energy-neutral operation, which eliminates recharging requirements. Optimal energy management is a challenging task due to the dynamic nature of the harvested energy and the battery energy constraints of the target device. To address this challenge, we present a reinforcement learning based energy management framework, tinyMAN, for resource-constrained wearable IoT devices. The framework maximizes the utilization of the target device under dynamic energy harvesting patterns and battery constraints. Moreover, tinyMAN does not rely on forecasts of the harvested energy which makes it a prediction-free approach. We deployed tinyMAN on a wearable device prototype using TensorFlow Lite for Micro thanks to its small memory footprint of less than 100 KB. Our evaluations show that tinyMAN achieves less than 2.36 ms and 27.75 uJ while maintaining up to 45% higher utility compared to prior approaches. |
Toygun Basaklar · Yigit Tuncel · Umit Ogras 🔗 |
-
|
tinyMAN: Lightweight Energy Manager using Reinforcement Learning for Energy Harvesting Wearable IoT Devices
(
Spotlight
)
Advances in low-power electronics and machine learning techniques lead to many novel wearable IoT devices. These devices have limited battery capacity and computational power.Thus, energy harvesting from ambient sources is a promising solution to power these low-energy wearable devices. They need to manage the harvested energy optimally to achieve energy-neutral operation, which eliminates recharging requirements. Optimal energy management is a challenging task due to the dynamic nature of the harvested energy and the battery energy constraints of the target device. To address this challenge, we present a reinforcement learning based energy management framework, tinyMAN, for resource-constrained wearable IoT devices. The framework maximizes the utilization of the target device under dynamic energy harvesting patterns and battery constraints. Moreover, tinyMAN does not rely on forecasts of the harvested energy which makes it a prediction-free approach. We deployed tinyMAN on a wearable device prototype using TensorFlow Lite for Micro thanks to its small memory footprint of less than 100 KB. Our evaluations show that tinyMAN achieves less than 2.36 ms and 27.75 uJ while maintaining up to 45% higher utility compared to prior approaches. |
Toygun Basaklar · Yigit Tuncel · Umit Ogras 🔗 |
-
|
Optimizing Audio Recommendations for the Long-Term
(
Poster
)
SlidesLive Video » We study the problem of optimizing recommender systems for outcomes that realize over several weeks or months. Successfully addressing this problem requires overcoming difficult statistical and organizational challenges. We begin by drawing on reinforcement learning to formulate a comprehensive model of users' recurring relationship with a recommender system. We then identify a few key assumptions that lead to simple, testable recommender system prototypes that explicitly optimize for the long-term. We apply our approach to a podcast recommender system at a large online audio streaming service, and we demonstrate that purposefully optimizing for long-term outcomes can lead to substantial performance gains over approaches optimizing for short-term proxies. |
Lucas Maystre · Daniel Russo · Yu Zhao 🔗 |
-
|
Optimizing Audio Recommendations for the Long-Term
(
Spotlight
)
We study the problem of optimizing recommender systems for outcomes that realize over several weeks or months. Successfully addressing this problem requires overcoming difficult statistical and organizational challenges. We begin by drawing on reinforcement learning to formulate a comprehensive model of users' recurring relationship with a recommender system. We then identify a few key assumptions that lead to simple, testable recommender system prototypes that explicitly optimize for the long-term. We apply our approach to a podcast recommender system at a large online audio streaming service, and we demonstrate that purposefully optimizing for long-term outcomes can lead to substantial performance gains over approaches optimizing for short-term proxies. |
Lucas Maystre · Daniel Russo · Yu Zhao 🔗 |
-
|
Controlling Commercial Cooling Systems Using Reinforcement Learning
(
Poster
)
This paper is a technical overview of our recent work on reinforcement learning for controlling commercial cooling systems. Building on previous work on cooling data centers more efficiently, we recently conducted two live experiments in partnership with a building management system provider. These live experiments had a variety of challenges in areas such as evaluation, learning from offline data, and constraint satisfaction. Our paper describes these challenges in the hope that awareness of them will benefit future applied RL work. We also describe the way we adapted our RL system to deal with these challenges, resulting in energy savings of approximately 9% and 13% respectively at the two live experiment sites. |
Jerry Luo · Cosmin Paduraru · Octavian Voicu · Yuri Chervonyi · Scott Munns · Jerry Li · Crystal Qian · Praneet Dutta · Daniel Mankowitz · Jared Quincy Davis · Ningjia Wu · Xingwei Yang · Chu-Ming Chang · Ted Li · Rob Rose · Mingyan Fan · Hootan Nakhost · Tinglin Liu · Deeni Fatiha · Neil Satra · Juliet Rothenberg · Molly Carlin · Satish Tallapaka · Sims Witherspoon · David Parish · Peter Dolan · Chenyu Zhao
|
-
|
Controlling Commercial Cooling Systems Using Reinforcement Learning
(
Spotlight
)
SlidesLive Video » This paper is a technical overview of our recent work on reinforcement learning for controlling commercial cooling systems. Building on previous work on cooling data centers more efficiently, we recently conducted two live experiments in partnership with a building management system provider. These live experiments had a variety of challenges in areas such as evaluation, learning from offline data, and constraint satisfaction. Our paper describes these challenges in the hope that awareness of them will benefit future applied RL work. We also describe the way we adapted our RL system to deal with these challenges, resulting in energy savings of approximately 9% and 13% respectively at the two live experiment sites. |
Jerry Luo · Cosmin Paduraru · Octavian Voicu · Yuri Chervonyi · Scott Munns · Jerry Li · Crystal Qian · Praneet Dutta · Daniel Mankowitz · Jared Quincy Davis · Ningjia Wu · Xingwei Yang · Chu-Ming Chang · Ted Li · Rob Rose · Mingyan Fan · Hootan Nakhost · Tinglin Liu · Deeni Fatiha · Neil Satra · Juliet Rothenberg · Molly Carlin · Satish Tallapaka · Sims Witherspoon · David Parish · Peter Dolan · Chenyu Zhao
|
-
|
Multi-Agent Reinforcement Learning for Fast-Timescale Demand Response
(
Poster
)
SlidesLive Video » To integrate high amounts of renewable energy resources, power grids must be able to cope with high amplitude, fast timescale variations in power generation. Frequency regulation through demand response has the potential to coordinate temporally flexible loads, such as air conditioners, to counteract these variations. Existing approaches for discrete control with dynamic constraints struggle to provide satisfactory performance for fast timescale action selection with hundreds of agents. We propose a decentralized agent trained by multi-agent proximal policy optimization with localized communication. We show that the resulting policy leads to good and robust performance for frequency regulation and scales seamlessly to arbitrary numbers of houses for constant processing times, where classical methods fail. |
Vincent Mai · Philippe Maisonneuve · Tianyu Zhang · Jorge Montalvo Arvizu · Liam Paull · Antoine Lesage-Landry 🔗 |
-
|
Multi-Agent Reinforcement Learning for Fast-Timescale Demand Response
(
Spotlight
)
To integrate high amounts of renewable energy resources, power grids must be able to cope with high amplitude, fast timescale variations in power generation. Frequency regulation through demand response has the potential to coordinate temporally flexible loads, such as air conditioners, to counteract these variations. Existing approaches for discrete control with dynamic constraints struggle to provide satisfactory performance for fast timescale action selection with hundreds of agents. We propose a decentralized agent trained by multi-agent proximal policy optimization with localized communication. We show that the resulting policy leads to good and robust performance for frequency regulation and scales seamlessly to arbitrary numbers of houses for constant processing times, where classical methods fail. |
Vincent Mai · Philippe Maisonneuve · Tianyu Zhang · Jorge Montalvo Arvizu · Liam Paull · Antoine Lesage-Landry 🔗 |
-
|
Identifying Disparities in Sepsis Treatment by Learning the Expert Policy
(
Poster
)
SlidesLive Video » Sepsis is a life-threatening condition defined by end-organ dysfunction due to a dysregulated host response to infection. Sepsis has been the focus of intense research in the field of machine learning with the primary aim being the ability to predict the onset of disease and to identify the optimal treatment policies for this complex condition. Here, we apply a number of reinforcement learning techniques including behavioral cloning, imitation learning, and inverse reinforcement learning, to learn the optimal policy in the management of septic patients using expert demonstrations. Then we estimate the counterfactual optimal policies by applying the model to another subset of unseen medical populations and identify the difference in cure by comparing it to the real policy. Our data comes from the sepsis cohort of MIMIC-IV and the clinical data warehouses of the Mass General Brigham healthcare system. The ultimate objective of this work is to use the optimal reward function to estimate the counterfactual treatment policy and identify deviations across sub-populations of interest. We hope this approach would help us identify any disparities in care and also changes in cure in response to the publication of national sepsis treatment guidelines. |
Hyewon Jeong · Siddharth Nayak · Taylor Killian · Sanjat Kanjilal · Marzyeh Ghassemi 🔗 |
-
|
Identifying Disparities in Sepsis Treatment by Learning the Expert Policy
(
Spotlight
)
Sepsis is a life-threatening condition defined by end-organ dysfunction due to a dysregulated host response to infection. Sepsis has been the focus of intense research in the field of machine learning with the primary aim being the ability to predict the onset of disease and to identify the optimal treatment policies for this complex condition. Here, we apply a number of reinforcement learning techniques including behavioral cloning, imitation learning, and inverse reinforcement learning, to learn the optimal policy in the management of septic patients using expert demonstrations. Then we estimate the counterfactual optimal policies by applying the model to another subset of unseen medical populations and identify the difference in cure by comparing it to the real policy. Our data comes from the sepsis cohort of MIMIC-IV and the clinical data warehouses of the Mass General Brigham healthcare system. The ultimate objective of this work is to use the optimal reward function to estimate the counterfactual treatment policy and identify deviations across sub-populations of interest. We hope this approach would help us identify any disparities in care and also changes in cure in response to the publication of national sepsis treatment guidelines. |
Hyewon Jeong · Siddharth Nayak · Taylor Killian · Sanjat Kanjilal · Marzyeh Ghassemi 🔗 |
-
|
Bandits for Online Calibration: An Application to Content Moderation on Social Media Platforms
(
Poster
)
We describe the current content moderation strategy employed by Meta to remove policy-violating content from its platforms. Meta relies on both handcrafted and learned risk models to flag potentially violating content for human review. Our approach aggregates these risk models into a single ranking score, calibrating them to prioritize more reliable risk models. A key challenge is that violation trends change over time, affecting which risk models are most reliable. Our system additionally handles production challenges such as changing risk models and novel risk models. We use a contextual bandit to update the calibration in response to such trends. Our approach increases Meta's top-line metric for measuring the effectiveness of its content moderation strategy by 13%. |
Vashist Avadhanula · Omar Abdul Baki · Hamsa Bastani · Osbert Bastani · Caner Gocmen · Daniel Haimovich · Darren Hwang · Dmytro Karamshuk · Thomas Leeper · Jiayuan Ma · Gregory macnamara · Jake Mullet · Christopher Palow · Sung Park · Varun S Rajagopal · Kevin Schaeffer · Parikshit Shah · Deeksha Sinha · Nicolas Stier-Moses · Ben Xu
|
-
|
Bandits for Online Calibration: An Application to Content Moderation on Social Media Platforms
(
Spotlight
)
SlidesLive Video » We describe the current content moderation strategy employed by Meta to remove policy-violating content from its platforms. Meta relies on both handcrafted and learned risk models to flag potentially violating content for human review. Our approach aggregates these risk models into a single ranking score, calibrating them to prioritize more reliable risk models. A key challenge is that violation trends change over time, affecting which risk models are most reliable. Our system additionally handles production challenges such as changing risk models and novel risk models. We use a contextual bandit to update the calibration in response to such trends. Our approach increases Meta's top-line metric for measuring the effectiveness of its content moderation strategy by 13%. |
Vashist Avadhanula · Omar Abdul Baki · Hamsa Bastani · Osbert Bastani · Caner Gocmen · Daniel Haimovich · Darren Hwang · Dmytro Karamshuk · Thomas Leeper · Jiayuan Ma · Gregory macnamara · Jake Mullet · Christopher Palow · Sung Park · Varun S Rajagopal · Kevin Schaeffer · Parikshit Shah · Deeksha Sinha · Nicolas Stier-Moses · Ben Xu
|
-
|
Beyond CAGE: Investigating Generalization of Learned Autonomous Network Defense Policies
(
Poster
)
Advancements in reinforcement learning (RL) have inspired new directions in intelligent automation of network defense. However, many of these advancements have either outpaced their application to network security or have not considered the challenges associated with implementing them in the real-world. To understand these problems, this work evaluates several RL approaches implemented in the CAGE Challenge 2, a public competition to build an autonomous network defender agent in a high-fidelity network simulator. Our approaches all build on the Proximal Policy Optimization (PPO) family of algorithms, and include hierarchical RL, action masking, custom training, and ensemble RL. We find that the ensemble RL technique performs strongest, outperforming our other models and taking second place in the competition. To understand applicability to real environments we evaluate each method's ability to generalize to unseen networks and against an unknown attack strategy. In unseen environments, all of our approaches perform worse, with degradation varied based on the type of environmental change. Against an unknown attacker strategy, we found that our models had reduced overall performance even though the new strategy was in fact less efficient than the ones our models trained on. Taken together, these results highlight promising research directions towards autonomous network defense in the real world. |
Melody Wolk · Andy Applebaum · Camron Dennler · Patrick Dwyer · Marina Moskowitz · Harold Nguyen · Nicole Nichols · Nicole Park · Paul Rachwalski · Frank Rau · Adrian Webster
|
-
|
Beyond CAGE: Investigating Generalization of Learned Autonomous Network Defense Policies
(
Spotlight
)
SlidesLive Video » Advancements in reinforcement learning (RL) have inspired new directions in intelligent automation of network defense. However, many of these advancements have either outpaced their application to network security or have not considered the challenges associated with implementing them in the real-world. To understand these problems, this work evaluates several RL approaches implemented in the CAGE Challenge 2, a public competition to build an autonomous network defender agent in a high-fidelity network simulator. Our approaches all build on the Proximal Policy Optimization (PPO) family of algorithms, and include hierarchical RL, action masking, custom training, and ensemble RL. We find that the ensemble RL technique performs strongest, outperforming our other models and taking second place in the competition. To understand applicability to real environments we evaluate each method's ability to generalize to unseen networks and against an unknown attack strategy. In unseen environments, all of our approaches perform worse, with degradation varied based on the type of environmental change. Against an unknown attacker strategy, we found that our models had reduced overall performance even though the new strategy was in fact less efficient than the ones our models trained on. Taken together, these results highlight promising research directions towards autonomous network defense in the real world. |
Melody Wolk · Andy Applebaum · Camron Dennler · Patrick Dwyer · Marina Moskowitz · Harold Nguyen · Nicole Nichols · Nicole Park · Paul Rachwalski · Frank Rau · Adrian Webster
|
-
|
Optimizing Industrial HVAC Systems with Hierarchical Reinforcement Learning
(
Poster
)
SlidesLive Video » Reinforcement learning (RL) techniques have been developed to optimize industrial cooling systems, offering substantial energy savings compared to traditional heuristic policies. A major challenge in industrial control involves learning behaviors that are feasible in the real world due to machinery constraints. For example, certain actions can only be executed every few hours while other actions can be taken more frequently. Without extensive reward engineering and experimentation, an RL agent may not learn realistic operation of machinery. To address this, we use hierarchical reinforcement learning with multiple agents that control subsets of actions according to their operation time scales. Our hierarchical approach achieves energy savings over existing baselines while maintaining constraints such as operating chillers within safe bounds in a simulated HVAC control environment. |
William Wong · Praneet Dutta · Octavian Voicu · Yuri Chervonyi · Cosmin Paduraru · Jerry Luo 🔗 |
-
|
Optimizing Industrial HVAC Systems with Hierarchical Reinforcement Learning
(
Spotlight
)
Reinforcement learning (RL) techniques have been developed to optimize industrial cooling systems, offering substantial energy savings compared to traditional heuristic policies. A major challenge in industrial control involves learning behaviors that are feasible in the real world due to machinery constraints. For example, certain actions can only be executed every few hours while other actions can be taken more frequently. Without extensive reward engineering and experimentation, an RL agent may not learn realistic operation of machinery. To address this, we use hierarchical reinforcement learning with multiple agents that control subsets of actions according to their operation time scales. Our hierarchical approach achieves energy savings over existing baselines while maintaining constraints such as operating chillers within safe bounds in a simulated HVAC control environment. |
William Wong · Praneet Dutta · Octavian Voicu · Yuri Chervonyi · Cosmin Paduraru · Jerry Luo 🔗 |
-
|
Reinforcement Learning Approaches for Traffic Signal Control under Missing Data
(
Poster
)
SlidesLive Video » Traffic signal control is critical in improving transportation efficiency and alleviating traffic congestion. In recent years, the emergence of deep reinforcement learning (RL) methods in traffic signal control tasks has achieved better performance than conventional rule-based approaches. Most RL approaches require the observation of the environment for the agent to decide which action is optimal for a long-term reward. However, in real-world urban scenarios, missing observation of traffic states may frequently occur due to the lack of sensors, which makes existing RL methods inapplicable on road networks with missing observation. In this work, we aim to control the traffic signals under the real-world setting, where some of the intersections in the road network are not installed with sensors and thus with no direct observations around them. Specifically, we propose and investigate two types of approaches: the first approach imputes the traffic states to enable adaptive control, while the second approach imputes both states and rewards to enable not only adaptive control but also the training of RL agents as well. Through extensive experiments on both synthetic and real-world road network traffic, we reveal that imputation can help the application of RL methods on intersections without observations, while the position of intersections without observation can largely influence the performance of RL agents. |
Hao Mei · Junxian Li · Bin Shi · Hua Wei 🔗 |
-
|
Reinforcement Learning Approaches for Traffic Signal Control under Missing Data
(
Spotlight
)
SlidesLive Video » Traffic signal control is critical in improving transportation efficiency and alleviating traffic congestion. In recent years, the emergence of deep reinforcement learning (RL) methods in traffic signal control tasks has achieved better performance than conventional rule-based approaches. Most RL approaches require the observation of the environment for the agent to decide which action is optimal for a long-term reward. However, in real-world urban scenarios, missing observation of traffic states may frequently occur due to the lack of sensors, which makes existing RL methods inapplicable on road networks with missing observation. In this work, we aim to control the traffic signals under the real-world setting, where some of the intersections in the road network are not installed with sensors and thus with no direct observations around them. Specifically, we propose and investigate two types of approaches: the first approach imputes the traffic states to enable adaptive control, while the second approach imputes both states and rewards to enable not only adaptive control but also the training of RL agents as well. Through extensive experiments on both synthetic and real-world road network traffic, we reveal that imputation can help the application of RL methods on intersections without observations, while the position of intersections without observation can largely influence the performance of RL agents. |
Hao Mei · Junxian Li · Bin Shi · Hua Wei 🔗 |
-
|
Reinforcement Learning-Based Air Traffic Deconfliction
(
Poster
)
Remain Well Clear, keeping the aircraft away from hazards by the appropriateseparation distance, is an essential technology for the safe operation of uncrewedaerial vehicles in congested airspace. This work focuses on automating the horizontal separation of two aircraft and presents the obstacle avoidance problem as a2D surrogate optimization task. By our design, the surrogate task is made moreconservative to guarantee the execution of the solution in the primary domain.Using Reinforcement Learning (RL), we optimize the avoidance policy and modelthe dynamics, interactions, and decision-making. By recursively sampling theresulting policy and the surrogate transitions, the system translates the avoidancepolicy into a complete avoidance trajectory. Then, the solver publishes the trajectoryas a set of waypoints for the airplane to follow using the Robot Operating System(ROS) interface.The proposed system generates a quick and achievable avoidance trajectory thatsatisfies the safety requirements. Evaluation of our system is completed in a high-fidelity simulation and full-scale airplane demonstration. Moreover, the paperconcludes an enormous integration effort that has enabled a real-life demonstrationof the RL-based system. |
Denis Osipychev · Dragos Margineantu 🔗 |
-
|
Reinforcement Learning-Based Air Traffic Deconfliction
(
Spotlight
)
SlidesLive Video » Remain Well Clear, keeping the aircraft away from hazards by the appropriateseparation distance, is an essential technology for the safe operation of uncrewedaerial vehicles in congested airspace. This work focuses on automating the horizontal separation of two aircraft and presents the obstacle avoidance problem as a2D surrogate optimization task. By our design, the surrogate task is made moreconservative to guarantee the execution of the solution in the primary domain.Using Reinforcement Learning (RL), we optimize the avoidance policy and modelthe dynamics, interactions, and decision-making. By recursively sampling theresulting policy and the surrogate transitions, the system translates the avoidancepolicy into a complete avoidance trajectory. Then, the solver publishes the trajectoryas a set of waypoints for the airplane to follow using the Robot Operating System(ROS) interface.The proposed system generates a quick and achievable avoidance trajectory thatsatisfies the safety requirements. Evaluation of our system is completed in a high-fidelity simulation and full-scale airplane demonstration. Moreover, the paperconcludes an enormous integration effort that has enabled a real-life demonstrationof the RL-based system. |
Denis Osipychev · Dragos Margineantu 🔗 |
-
|
Automatic Evaluation of Excavator Operators using Learned Reward Functions
(
Poster
)
SlidesLive Video » Training novice users to operate an excavator for learning different skills requires the presence of expert teachers. Considering the complexity of the problem, it is comparatively expensive to find skilled experts as the process is time consuming and requires precise focus. Moreover, since humans tend to be biased, the evaluation process is noisy and will lead to high variance in the final score of different operators with similar skills. In this work, we address these issues and propose a novel strategy for automatic evaluation of excavator operators. We take into account the internal dynamics of the excavator and the safety criterion at every time step to evaluate the performance. To further validate our approach, we use this score prediction model as a source of reward for a reinforcement learning agent to learn the task of maneuvering an excavator in a simulated environment that closely replicates the real-world dynamics. For a policy learned using these external reward prediction models, our results demonstrate safer solutions following the required dynamic constraints when compared to policy trained with goal based reward functions only, making it one step closer to real-life adoption. |
Pranav Agarwal · Marek Teichmann · Sheldon Andrews · Samira Ebrahimi Kahou 🔗 |
-
|
Automatic Evaluation of Excavator Operators using Learned Reward Functions
(
Spotlight
)
Training novice users to operate an excavator for learning different skills requires the presence of expert teachers. Considering the complexity of the problem, it is comparatively expensive to find skilled experts as the process is time consuming and requires precise focus. Moreover, since humans tend to be biased, the evaluation process is noisy and will lead to high variance in the final score of different operators with similar skills. In this work, we address these issues and propose a novel strategy for automatic evaluation of excavator operators. We take into account the internal dynamics of the excavator and the safety criterion at every time step to evaluate the performance. To further validate our approach, we use this score prediction model as a source of reward for a reinforcement learning agent to learn the task of maneuvering an excavator in a simulated environment that closely replicates the real-world dynamics. For a policy learned using these external reward prediction models, our results demonstrate safer solutions following the required dynamic constraints when compared to policy trained with goal based reward functions only, making it one step closer to real-life adoption. |
Pranav Agarwal · Marek Teichmann · Sheldon Andrews · Samira Ebrahimi Kahou 🔗 |
-
|
Function Approximations for Reinforcement Learning Controller for Wave Energy Converters
(
Poster
)
SlidesLive Video » The industrial Wave Energy Converters (WEC) have evolved into complex multi-generator designs, but a lack of effective control has limited their potential for higher energy capture efficiency. The Multi-Agent Reinforcement Learning (MARL) controller can handle these complexities and support multiple objectives of energy capture efficiency, reduction of structural stress, and proactive protection against high waves. However, even with well-trained agent algorithms like Proximal Policy Optimization (PPO), MARL is limited in performance. In this paper, we explore different function approximations for the policy and critic networks in modeling the sequential nature of the system dynamics and find that they are key to better performance. We investigated the performance of fully connected neural networks (FCN), LSTMs, and the Transformer model variants with varying depths. We propose a novel transformer architecture, Skip Transformer-XL (STrXL), with gated residual connections around the multi-head attention, multi-layer perceptron, and transformer block. Our results suggest that STrXL performed best and beat the state-of-the-art GTrXL with faster training convergence. STrXL boosts energy efficiency by an average of 25% to 28% for the entire wave spectrum over the existing spring damper (SD) controller for waves at different angles. Furthermore, unlike the default SD controller, the transformer controller almost eliminated the mechanical stress from the rotational yaw motion. |
Soumyendu Sarkar · Vineet Gundecha · Alexander Shmakov · Sahand Ghorbanpour · Ashwin Ramesh Babu · Alexandre Pichard · Mathieu Cocho 🔗 |
-
|
Function Approximations for Reinforcement Learning Controller for Wave Energy Converters
(
Spotlight
)
SlidesLive Video » The industrial Wave Energy Converters (WEC) have evolved into complex multi-generator designs, but a lack of effective control has limited their potential for higher energy capture efficiency. The Multi-Agent Reinforcement Learning (MARL) controller can handle these complexities and support multiple objectives of energy capture efficiency, reduction of structural stress, and proactive protection against high waves. However, even with well-trained agent algorithms like Proximal Policy Optimization (PPO), MARL is limited in performance. In this paper, we explore different function approximations for the policy and critic networks in modeling the sequential nature of the system dynamics and find that they are key to better performance. We investigated the performance of fully connected neural networks (FCN), LSTMs, and the Transformer model variants with varying depths. We propose a novel transformer architecture, Skip Transformer-XL (STrXL), with gated residual connections around the multi-head attention, multi-layer perceptron, and transformer block. Our results suggest that STrXL performed best and beat the state-of-the-art GTrXL with faster training convergence. STrXL boosts energy efficiency by an average of 25% to 28% for the entire wave spectrum over the existing spring damper (SD) controller for waves at different angles. Furthermore, unlike the default SD controller, the transformer controller almost eliminated the mechanical stress from the rotational yaw motion. |
Soumyendu Sarkar · Vineet Gundecha · Alexander Shmakov · Sahand Ghorbanpour · Ashwin Ramesh Babu · Alexandre Pichard · Mathieu Cocho 🔗 |
Author Information
Yuxi Li (attain.ai)
Emma Brunskill (Stanford University)
MINMIN CHEN (Google)
Omer Gottesman
Lihong Li (Amazon)
Yao Liu (Amazon)
Zhiwei Tony Qin (Columbia University)
Matthew Taylor (U. of Alberta)
More from the Same Authors
-
2021 : Identification of Subgroups With Similar Benefits in Off-Policy Policy Evaluation »
Ramtin Keramati · Omer Gottesman · Leo Celi · Finale Doshi-Velez · Emma Brunskill -
2021 : Safe Evaluation For Offline Learning: \\Are We Ready To Deploy? »
Hager Radi · Josiah Hanna · Peter Stone · Matthew Taylor -
2021 : Safe Evaluation For Offline Learning: \\Are We Ready To Deploy? »
Hager Radi · Josiah Hanna · Peter Stone · Matthew Taylor -
2022 Poster: Multiagent Q-learning with Sub-Team Coordination »
Wenhan Huang · Kai Li · Kun Shao · Tianze Zhou · Matthew Taylor · Jun Luo · Dongge Wang · Hangyu Mao · Jianye Hao · Jun Wang · Xiaotie Deng -
2022 : Towards Companion Recommendation Systems »
Konstantina Christakopoulou · Yuyan Wang · Ed Chi · MINMIN CHEN -
2022 : Fifteen-minute Competition Overview Video »
Tianpei Yang · Iuliia Kotseruba · Montgomery Alban · Amir Rasouli · Soheil Mohamad Alizadeh Shabestary · Randolph Goebel · Matthew Taylor · Liam Paull · Florian Shkurti -
2022 : Do As You Teach: A Multi-Teacher Approach to Self-Play in Deep Reinforcement Learning »
Chaitanya Kharyal · Tanmay Sinha · Vijaya Sai Krishna Gottipati · Srijita Das · Matthew Taylor -
2023 Poster: Ignorance is Bliss: Robust Control via Information Gating »
Manan Tomar · Riashat Islam · Matthew Taylor · Sergey Levine · Philip Bachman -
2023 Poster: In-Context Decision-Making from Supervised Pretraining »
Jonathan N Lee · Annie Xie · Aldo Pacchiano · Yash Chandak · Chelsea Finn · Ofir Nachum · Emma Brunskill -
2023 Poster: Experiment Planning with Function Approximation »
Aldo Pacchiano · Jonathan N Lee · Emma Brunskill -
2023 Poster: TD Convergence: An Optimization Perspective »
Kavosh Asadi · Shoham Sabach · Yao Liu · Omer Gottesman · Rasool Fakoor -
2023 Poster: Proportional Response: Contextual Bandits for Simple and Cumulative Regret Minimization »
Sanath Kumar Krishnamurthy · Ruohan Zhan · Susan Athey · Emma Brunskill -
2023 Poster: Waypoint Transformer: Reinforcement Learning via Supervised Learning with Intermediate Targets »
Anirudhan Badrinath · Yannis Flet-Berliac · Allen Nie · Emma Brunskill -
2023 Poster: Effectively Learning Initiation Sets in Hierarchical Reinforcement Learning »
Akhil Bagaria · Ben Abbatematteo · Omer Gottesman · Matt Corsaro · Sreehari Rammohan · George Konidaris -
2023 Poster: Budgeting Counterfactual for Offline RL »
Yao Liu · Pratik Chaudhari · Rasool Fakoor -
2022 Workshop: Deep Reinforcement Learning Workshop »
Karol Hausman · Qi Zhang · Matthew Taylor · Martha White · Suraj Nair · Manan Tomar · Risto Vuorio · Ted Xiao · Zeyu Zheng · Manan Tomar -
2022 Spotlight: Lightning Talks 5A-3 »
Minting Pan · Xiang Chen · Wenhan Huang · Can Chang · Zhecheng Yuan · Jianzhun Shao · Yushi Cao · Peihao Chen · Ke Xue · Zhengrong Xue · Zhiqiang Lou · Xiangming Zhu · Lei Li · Zhiming Li · Kai Li · Jiacheng Xu · Dongyu Ji · Ni Mu · Kun Shao · Tianpei Yang · Kunyang Lin · Ningyu Zhang · Yunbo Wang · Lei Yuan · Bo Yuan · Hongchang Zhang · Jiajun Wu · Tianze Zhou · Xueqian Wang · Ling Pan · Yuhang Jiang · Xiaokang Yang · Xiaozhuan Liang · Hao Zhang · Weiwen Hu · Miqing Li · YAN ZHENG · Matthew Taylor · Huazhe Xu · Shumin Deng · Chao Qian · YI WU · Shuncheng He · Wenbing Huang · Chuanqi Tan · Zongzhang Zhang · Yang Gao · Jun Luo · Yi Li · Xiangyang Ji · Thomas Li · Mingkui Tan · Fei Huang · Yang Yu · Huazhe Xu · Dongge Wang · Jianye Hao · Chuang Gan · Yang Liu · Luo Si · Hangyu Mao · Huajun Chen · Jianye Hao · Jun Wang · Xiaotie Deng -
2022 Spotlight: Multiagent Q-learning with Sub-Team Coordination »
Wenhan Huang · Kai Li · Kun Shao · Tianze Zhou · Matthew Taylor · Jun Luo · Dongge Wang · Hangyu Mao · Jianye Hao · Jun Wang · Xiaotie Deng -
2022 Competition: Driving SMARTS »
Amir Rasouli · Matthew Taylor · Iuliia Kotseruba · Tianpei Yang · Randolph Goebel · Soheil Mohamad Alizadeh Shabestary · Montgomery Alban · Florian Shkurti · Liam Paull -
2022 Poster: Oracle Inequalities for Model Selection in Offline Reinforcement Learning »
Jonathan N Lee · George Tucker · Ofir Nachum · Bo Dai · Emma Brunskill -
2022 Poster: Factored DRO: Factored Distributionally Robust Policies for Contextual Bandits »
Tong Mu · Yash Chandak · Tatsunori Hashimoto · Emma Brunskill -
2022 Poster: Faster Deep Reinforcement Learning with Slower Online Network »
Kavosh Asadi · Rasool Fakoor · Omer Gottesman · Taesup Kim · Michael Littman · Alexander Smola -
2022 Poster: Off-Policy Evaluation for Action-Dependent Non-stationary Environments »
Yash Chandak · Shiv Shankar · Nathaniel Bastian · Bruno da Silva · Emma Brunskill · Philip Thomas -
2022 Social: RL Social »
Yuxi Li · Omer Gottesman · Niranjani Prasad -
2022 Poster: Data-Efficient Pipeline for Offline Reinforcement Learning with Limited Data »
Allen Nie · Yannis Flet-Berliac · Deon Jordan · William Steenbergen · Emma Brunskill -
2022 Poster: Giving Feedback on Interactive Student Programs with Meta-Exploration »
Evan Liu · Moritz Stephan · Allen Nie · Chris Piech · Emma Brunskill · Chelsea Finn -
2022 Poster: Provably sample-efficient RL with side information about latent dynamics »
Yao Liu · Dipendra Misra · Miro Dudik · Robert Schapire -
2021 : Retrospective Panel »
Sergey Levine · Nando de Freitas · Emma Brunskill · Finale Doshi-Velez · Nan Jiang · Rishabh Agarwal -
2021 : Learning Representations for Pixel-based Control: What Matters and Why? »
Manan Tomar · Utkarsh A Mishra · Amy Zhang · Matthew Taylor -
2021 : Safe RL Debate »
Sylvia Herbert · Animesh Garg · Emma Brunskill · Aleksandra Faust · Dylan Hadfield-Menell -
2021 Workshop: Deep Reinforcement Learning »
Pieter Abbeel · Chelsea Finn · David Silver · Matthew Taylor · Martha White · Srijita Das · Yuqing Du · Andrew Patterson · Manan Tomar · Olivia Watkins -
2021 Poster: Play to Grade: Testing Coding Games as Classifying Markov Decision Process »
Allen Nie · Emma Brunskill · Chris Piech -
2021 Poster: Reinforcement Learning with State Observation Costs in Action-Contingent Noiselessly Observable Markov Decision Processes »
HyunJi Alex Nam · Scott Fleming · Emma Brunskill -
2021 Poster: Provable Benefits of Actor-Critic Methods for Offline Reinforcement Learning »
Andrea Zanette · Martin J Wainwright · Emma Brunskill -
2021 Poster: Universal Off-Policy Evaluation »
Yash Chandak · Scott Niekum · Bruno da Silva · Erik Learned-Miller · Emma Brunskill · Philip Thomas -
2021 Poster: Design of Experiments for Stochastic Contextual Linear Bandits »
Andrea Zanette · Kefan Dong · Jonathan N Lee · Emma Brunskill -
2020 : Counterfactuals and Offline RL »
Emma Brunskill -
2020 : Q & A and Panel Session with Dan Weld, Kristen Grauman, Scott Yih, Emma Brunskill, and Alex Ratner »
Kristen Grauman · Wen-tau Yih · Alexander Ratner · Emma Brunskill · Douwe Kiela · Daniel S. Weld -
2020 : Panel »
Emma Brunskill · Nan Jiang · Nando de Freitas · Finale Doshi-Velez · Sergey Levine · John Langford · Lihong Li · George Tucker · Rishabh Agarwal · Aviral Kumar -
2020 : Mini-panel discussion 1 - Bridging the gap between theory and practice »
Aviv Tamar · Emma Brunskill · Jost Tobias Springenberg · Omer Gottesman · Daniel Mankowitz -
2020 : Keynote: Emma Brunskill »
Emma Brunskill -
2020 : Panel discussion on minimizing bias in machine learning in education »
Neil Heffernan · Osonde A. Osoba · Emma Brunskill · Kathi Fisler -
2020 : Contributed Talk: Maximum Reward Formulation In Reinforcement Learning »
Vijaya Sai Krishna Gottipati · Yashaswi Pathak · Rohan Nuttall · Sahir . · Raviteja Chunduru · Ahmed Touati · Sriram Ganapathi · Matthew Taylor · Sarath Chandar -
2020 Poster: Off-policy Policy Evaluation For Sequential Decisions Under Unobserved Confounding »
Hongseok Namkoong · Ramtin Keramati · Steve Yadlowsky · Emma Brunskill -
2020 Poster: Escaping the Gravitational Pull of Softmax »
Jincheng Mei · Chenjun Xiao · Bo Dai · Lihong Li · Csaba Szepesvari · Dale Schuurmans -
2020 Poster: Provably Efficient Reward-Agnostic Navigation with Linear Value Iteration »
Andrea Zanette · Alessandro Lazaric · Mykel J Kochenderfer · Emma Brunskill -
2020 Oral: Escaping the Gravitational Pull of Softmax »
Jincheng Mei · Chenjun Xiao · Bo Dai · Lihong Li · Csaba Szepesvari · Dale Schuurmans -
2020 Poster: CoinDICE: Off-Policy Confidence Interval Estimation »
Bo Dai · Ofir Nachum · Yinlam Chow · Lihong Li · Csaba Szepesvari · Dale Schuurmans -
2020 Poster: Off-Policy Evaluation via the Regularized Lagrangian »
Mengjiao (Sherry) Yang · Ofir Nachum · Bo Dai · Lihong Li · Dale Schuurmans -
2020 Poster: Provably Good Batch Reinforcement Learning Without Great Exploration »
Yao Liu · Adith Swaminathan · Alekh Agarwal · Emma Brunskill -
2020 Spotlight: CoinDICE: Off-Policy Confidence Interval Estimation »
Bo Dai · Ofir Nachum · Yinlam Chow · Lihong Li · Csaba Szepesvari · Dale Schuurmans -
2019 : Closing Remarks »
Bo Dai · Niao He · Nicolas Le Roux · Lihong Li · Dale Schuurmans · Martha White -
2019 : Emma Brünskill, "Some Theory RL Challenges Inspired by Education" »
Emma Brunskill -
2019 : Poster and Coffee Break 2 »
Karol Hausman · Kefan Dong · Ken Goldberg · Lihong Li · Lin Yang · Lingxiao Wang · Lior Shani · Liwei Wang · Loren Amdahl-Culleton · Lucas Cassano · Marc Dymetman · Marc Bellemare · Marcin Tomczak · Margarita Castro · Marius Kloft · Marius-Constantin Dinu · Markus Holzleitner · Martha White · Mengdi Wang · Michael Jordan · Mihailo Jovanovic · Ming Yu · Minshuo Chen · Moonkyung Ryu · Muhammad Zaheer · Naman Agarwal · Nan Jiang · Niao He · Nikolaus Yasui · Nikos Karampatziakis · Nino Vieillard · Ofir Nachum · Olivier Pietquin · Ozan Sener · Pan Xu · Parameswaran Kamalaruban · Paul Mineiro · Paul Rolland · Philip Amortila · Pierre-Luc Bacon · Prakash Panangaden · Qi Cai · Qiang Liu · Quanquan Gu · Raihan Seraj · Richard Sutton · Rick Valenzano · Robert Dadashi · Rodrigo Toro Icarte · Roshan Shariff · Roy Fox · Ruosong Wang · Saeed Ghadimi · Samuel Sokota · Sean Sinclair · Sepp Hochreiter · Sergey Levine · Sergio Valcarcel Macua · Sham Kakade · Shangtong Zhang · Sheila McIlraith · Shie Mannor · Shimon Whiteson · Shuai Li · Shuang Qiu · Wai Lok Li · Siddhartha Banerjee · Sitao Luan · Tamer Basar · Thinh Doan · Tianhe Yu · Tianyi Liu · Tom Zahavy · Toryn Klassen · Tuo Zhao · Vicenç Gómez · Vincent Liu · Volkan Cevher · Wesley Suttle · Xiao-Wen Chang · Xiaohan Wei · Xiaotong Liu · Xingguo Li · Xinyi Chen · Xingyou Song · Yao Liu · YiDing Jiang · Yihao Feng · Yilun Du · Yinlam Chow · Yinyu Ye · Yishay Mansour · · Yonathan Efroni · Yongxin Chen · Yuanhao Wang · Bo Dai · Chen-Yu Wei · Harsh Shrivastava · Hongyang Zhang · Qinqing Zheng · SIDDHARTHA SATPATHI · Xueqing Liu · Andreu Vall -
2019 : Invited Talk »
Emma Brunskill -
2019 : Poster Spotlight 2 »
Aaron Sidford · Mengdi Wang · Lin Yang · Yinyu Ye · Zuyue Fu · Zhuoran Yang · Yongxin Chen · Zhaoran Wang · Ofir Nachum · Bo Dai · Ilya Kostrikov · Dale Schuurmans · Ziyang Tang · Yihao Feng · Lihong Li · Denny Zhou · Qiang Liu · Rodrigo Toro Icarte · Ethan Waldie · Toryn Klassen · Rick Valenzano · Margarita Castro · Simon Du · Sham Kakade · Ruosong Wang · Minshuo Chen · Tianyi Liu · Xingguo Li · Zhaoran Wang · Tuo Zhao · Philip Amortila · Doina Precup · Prakash Panangaden · Marc Bellemare -
2019 : Poster and Coffee Break 1 »
Aaron Sidford · Aditya Mahajan · Alejandro Ribeiro · Alex Lewandowski · Ali H Sayed · Ambuj Tewari · Angelika Steger · Anima Anandkumar · Asier Mujika · Hilbert J Kappen · Bolei Zhou · Byron Boots · Chelsea Finn · Chen-Yu Wei · Chi Jin · Ching-An Cheng · Christina Yu · Clement Gehring · Craig Boutilier · Dahua Lin · Daniel McNamee · Daniel Russo · David Brandfonbrener · Denny Zhou · Devesh Jha · Diego Romeres · Doina Precup · Dominik Thalmeier · Eduard Gorbunov · Elad Hazan · Elena Smirnova · Elvis Dohmatob · Emma Brunskill · Enrique Munoz de Cote · Ethan Waldie · Florian Meier · Florian Schaefer · Ge Liu · Gergely Neu · Haim Kaplan · Hao Sun · Hengshuai Yao · Jalaj Bhandari · James A Preiss · Jayakumar Subramanian · Jiajin Li · Jieping Ye · Jimmy Smith · Joan Bas Serrano · Joan Bruna · John Langford · Jonathan Lee · Jose A. Arjona-Medina · Kaiqing Zhang · Karan Singh · Yuping Luo · Zafarali Ahmed · Zaiwei Chen · Zhaoran Wang · Zhizhong Li · Zhuoran Yang · Ziping Xu · Ziyang Tang · Yi Mao · David Brandfonbrener · Shirli Di-Castro · Riashat Islam · Zuyue Fu · Abhishek Naik · Saurabh Kumar · Benjamin Petit · Angeliki Kamoutsi · Simone Totaro · Arvind Raghunathan · Rui Wu · Donghwan Lee · Dongsheng Ding · Alec Koppel · Hao Sun · Christian Tjandraatmadja · Mahdi Karami · Jincheng Mei · Chenjun Xiao · Junfeng Wen · Zichen Zhang · Ross Goroshin · Mohammad Pezeshki · Jiaqi Zhai · Philip Amortila · Shuo Huang · Mariya Vasileva · El houcine Bergou · Adel Ahmadyan · Haoran Sun · Sheng Zhang · Lukas Gruber · Yuanhao Wang · Tetiana Parshakova -
2019 Workshop: The Optimization Foundations of Reinforcement Learning »
Bo Dai · Niao He · Nicolas Le Roux · Lihong Li · Dale Schuurmans · Martha White -
2019 : Opening Remarks »
Bo Dai · Niao He · Nicolas Le Roux · Lihong Li · Dale Schuurmans · Martha White -
2019 Poster: Offline Contextual Bandits with High Probability Fairness Guarantees »
Blossom Metevier · Stephen Giguere · Sarah Brockman · Ari Kobren · Yuriy Brun · Emma Brunskill · Philip Thomas -
2019 Poster: A Kernel Loss for Solving the Bellman Equation »
Yihao Feng · Lihong Li · Qiang Liu -
2019 Poster: DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections »
Ofir Nachum · Yinlam Chow · Bo Dai · Lihong Li -
2019 Spotlight: DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections »
Ofir Nachum · Yinlam Chow · Bo Dai · Lihong Li -
2019 Poster: Almost Horizon-Free Structure-Aware Best Policy Identification with a Generative Model »
Andrea Zanette · Mykel J Kochenderfer · Emma Brunskill -
2019 Poster: Limiting Extrapolation in Linear Approximate Value Iteration »
Andrea Zanette · Alessandro Lazaric · Mykel J Kochenderfer · Emma Brunskill -
2018 : Hierarchical reinforcement learning for composite-task dialogues »
Lihong Li -
2018 Poster: Representation Balancing MDPs for Off-policy Policy Evaluation »
Yao Liu · Omer Gottesman · Aniruddh Raghu · Matthieu Komorowski · Aldo Faisal · Finale Doshi-Velez · Emma Brunskill -
2018 Poster: Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation »
Qiang Liu · Lihong Li · Ziyang Tang · Denny Zhou -
2018 Spotlight: Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation »
Qiang Liu · Lihong Li · Ziyang Tang · Denny Zhou -
2018 Demonstration: Automatic Curriculum Generation Applied to Teaching Novices a Short Bach Piano Segment »
Emma Brunskill · Tong Mu · Karan Goel · Jonathan Bragg -
2018 Poster: Adversarial Attacks on Stochastic Bandits »
Kwang-Sung Jun · Lihong Li · Yuzhe Ma · Jerry Zhu -
2017 : Panel Discussion »
Matt Botvinick · Emma Brunskill · Marcos Campos · Jan Peters · Doina Precup · David Silver · Josh Tenenbaum · Roy Fox -
2017 : Sample efficiency and off policy hierarchical RL (Emma Brunskill) »
Emma Brunskill -
2017 : Emma Brunskill (Stanford) »
Emma Brunskill -
2017 : Invited Talk »
Emma Brunskill -
2017 Workshop: From 'What If?' To 'What Next?' : Causal Inference and Machine Learning for Intelligent Decision Making »
Ricardo Silva · Panagiotis Toulis · John Shawe-Taylor · Alexander Volfovsky · Thorsten Joachims · Lihong Li · Nathan Kallus · Adith Swaminathan -
2017 Poster: Using Options and Covariance Testing for Long Horizon Off-Policy Policy Evaluation »
Zhaohan Guo · Philip S. Thomas · Emma Brunskill -
2017 Poster: Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning »
Christoph Dann · Tor Lattimore · Emma Brunskill -
2017 Spotlight: Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning »
Christoph Dann · Tor Lattimore · Emma Brunskill -
2017 Poster: Q-LDA: Uncovering Latent Patterns in Text-based Sequential Decision Processes »
Jianshu Chen · Chong Wang · Lin Xiao · Ji He · Lihong Li · Li Deng -
2017 Tutorial: Reinforcement Learning with People »
Emma Brunskill -
2016 Poster: Active Learning with Oracle Epiphany »
Tzu-Kuo Huang · Lihong Li · Ara Vartanian · Saleema Amershi · Jerry Zhu -
2011 Poster: An Empirical Evaluation of Thompson Sampling »
Olivier Chapelle · Lihong Li -
2010 Spotlight: Learning from Logged Implicit Exploration Data »
Alex Strehl · Lihong Li · John Langford · Sham M Kakade -
2010 Poster: Learning from Logged Implicit Exploration Data »
Alexander L Strehl · John Langford · Lihong Li · Sham M Kakade -
2010 Poster: Parallelized Stochastic Gradient Descent »
Martin A Zinkevich · Markus Weimer · Alexander Smola · Lihong Li -
2008 Poster: Sparse Online Learning via Truncated Gradient »
John Langford · Lihong Li · Tong Zhang -
2008 Spotlight: Sparse Online Learning via Truncated Gradient »
John Langford · Lihong Li · Tong Zhang