Timezone: »
The idea of conservatism has led to significant progress in offline reinforcement learning (RL) where an agent learns from pre-collected datasets. However, it is still an open question to resolve offline RL in the more practical multi-agent setting as many real-world scenarios involve interaction among multiple agents. Given the recent success of transferring online RL algorithms to the multi-agent setting, one may expect that offline RL algorithms will also transfer to multi-agent settings directly. Surprisingly, when conservatism-based algorithms are applied to the multi-agent setting, the performance degrades significantly with an increasing number of agents. Towards mitigating the degradation, we identify that a key issue that the landscape of the value function can be non-concave and policy gradient improvements are prone to local optima. Multiple agents exacerbate the problem since the suboptimal policy by any agent could lead to uncoordinated global failure. Following this intuition, we propose a simple yet effective method, \underline{O}ffline \underline{M}ulti-Agent RL with \underline{A}ctor \underline{R}ectification (OMAR), to tackle this critical challenge via an effective combination of first-order policy gradient and zeroth-order optimization methods for the actor to better optimize the conservative value function. Despite the simplicity, OMAR significantly outperforms strong baselines with state-of-the-art performance in multi-agent continuous control benchmarks.
Author Information
Ling Pan (Tsinghua University)
Longbo Huang (IIIS, Tsinghua Univeristy)
Tengyu Ma (Stanford University)
Huazhe Xu
More from the Same Authors
-
2021 Spotlight: Why Do Pretrained Language Models Help in Downstream Tasks? An Analysis of Head and Prompt Tuning »
Colin Wei · Sang Michael Xie · Tengyu Ma -
2021 : Calibrated Ensembles: A Simple Way to Mitigate ID-OOD Accuracy Tradeoffs »
Ananya Kumar · Aditi Raghunathan · Tengyu Ma · Percy Liang -
2021 : Self-supervised Learning is More Robust to Dataset Imbalance »
Hong Liu · Jeff Z. HaoChen · Adrien Gaidon · Tengyu Ma -
2021 : Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers »
Ruihan Yang · Minghao Zhang · Nicklas Hansen · Huazhe Xu · Xiaolong Wang -
2021 : DR3: Value-Based Deep Reinforcement Learning Requires Explicit Regularization »
Aviral Kumar · Rishabh Agarwal · Tengyu Ma · Aaron Courville · George Tucker · Sergey Levine -
2022 Poster: Provable Generalization of Overparameterized Meta-learning Trained with SGD »
Yu Huang · Yingbin Liang · Longbo Huang -
2022 : Why (and When) does Local SGD Generalize Better than SGD? »
Xinran Gu · Kaifeng Lyu · Longbo Huang · Sanjeev Arora -
2022 : Online Min-max Optimization: Nonconvexity, Nonstationarity, and Dynamic Regret »
Yu Huang · Yuan Cheng · Yingbin Liang · Longbo Huang -
2022 Spotlight: Lightning Talks 3B-2 »
Yu Huang · Tero Karras · Maxim Kodryan · Shiau Hong Lim · Shudong Huang · Ziyu Wang · Siqiao Xue · ILYAS MALIK · Ekaterina Lobacheva · Miika Aittala · Hongjie Wu · Yuhao Zhou · Yingbin Liang · Xiaoming Shi · Jun Zhu · Maksim Nakhodnov · Timo Aila · Yazhou Ren · James Zhang · Longbo Huang · Dmitry Vetrov · Ivor Tsang · Hongyuan Mei · Samuli Laine · Zenglin Xu · Wentao Feng · Jiancheng Lv -
2022 Spotlight: Provable Generalization of Overparameterized Meta-learning Trained with SGD »
Yu Huang · Yingbin Liang · Longbo Huang -
2021 : DR3: Value-Based Deep Reinforcement Learning Requires Explicit Regularization Q&A »
Aviral Kumar · Rishabh Agarwal · Tengyu Ma · Aaron Courville · George Tucker · Sergey Levine -
2021 : DR3: Value-Based Deep Reinforcement Learning Requires Explicit Regularization »
Aviral Kumar · Rishabh Agarwal · Tengyu Ma · Aaron Courville · George Tucker · Sergey Levine -
2021 Poster: The best of both worlds: stochastic and adversarial episodic MDPs with unknown transition »
Tiancheng Jin · Longbo Huang · Haipeng Luo -
2021 Poster: Multi-Agent Reinforcement Learning in Stochastic Networked Systems »
Yiheng Lin · Guannan Qu · Longbo Huang · Adam Wierman -
2021 Poster: Label Noise SGD Provably Prefers Flat Global Minimizers »
Alex Damian · Tengyu Ma · Jason Lee -
2021 Poster: Learning Barrier Certificates: Towards Safe Reinforcement Learning with Zero Training-time Violations »
Yuping Luo · Tengyu Ma -
2021 Poster: Continuous Mean-Covariance Bandits »
Yihan Du · Siwei Wang · Zhixuan Fang · Longbo Huang -
2021 Poster: Calibrating Predictions to Decisions: A Novel Approach to Multi-Class Calibration »
Shengjia Zhao · Michael Kim · Roshni Sahoo · Tengyu Ma · Stefano Ermon -
2021 Poster: Fast Federated Learning in the Presence of Arbitrary Device Unavailability »
Xinran Gu · Kaixuan Huang · Jingzhao Zhang · Longbo Huang -
2021 Poster: What Makes Multi-Modal Learning Better than Single (Provably) »
Yu Huang · Chenzhuang Du · Zihui Xue · Xuanyao Chen · Hang Zhao · Longbo Huang -
2021 Poster: Why Do Pretrained Language Models Help in Downstream Tasks? An Analysis of Head and Prompt Tuning »
Colin Wei · Sang Michael Xie · Tengyu Ma -
2021 Oral: Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss »
Jeff Z. HaoChen · Colin Wei · Adrien Gaidon · Tengyu Ma -
2021 Poster: Regularized Softmax Deep Multi-Agent Q-Learning »
Ling Pan · Tabish Rashid · Bei Peng · Longbo Huang · Shimon Whiteson -
2021 Poster: Safe Reinforcement Learning by Imagining the Near Future »
Garrett Thomas · Yuping Luo · Tengyu Ma -
2021 Poster: Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss »
Jeff Z. HaoChen · Colin Wei · Adrien Gaidon · Tengyu Ma -
2021 Poster: Provable Model-based Nonlinear Bandit and Reinforcement Learning: Shelve Optimism, Embrace Virtual Curvature »
Kefan Dong · Jiaqi Yang · Tengyu Ma -
2021 Oral: The best of both worlds: stochastic and adversarial episodic MDPs with unknown transition »
Tiancheng Jin · Longbo Huang · Haipeng Luo -
2020 Poster: Softmax Deep Double Deterministic Policy Gradients »
Ling Pan · Qingpeng Cai · Longbo Huang -
2020 Poster: Restless-UCB, an Efficient and Low-complexity Algorithm for Online Restless Bandits »
Siwei Wang · Longbo Huang · John C. S. Lui -
2019 : Tengyu Ma, "Designing Explicit Regularizers for Deep Models" »
Tengyu Ma -
2019 Poster: Double Quantization for Communication-Efficient Distributed Optimization »
Yue Yu · Jiaxiang Wu · Longbo Huang -
2018 : Poster Session »
Sujay Sanghavi · Vatsal Shah · Yanyao Shen · Tianchen Zhao · Yuandong Tian · Tomer Galanti · Mufan Li · Gilad Cohen · Daniel Rothchild · Aristide Baratin · Devansh Arpit · Vagelis Papalexakis · Michael Perlmutter · Ashok Vardhan Makkuva · Pim de Haan · Yingyan Lin · Wanmo Kang · Cheolhyoung Lee · Hao Shen · Sho Yaida · Dan Roberts · Nadav Cohen · Philippe Casgrain · Dejiao Zhang · Tengyu Ma · Avinash Ravichandran · Julian Emilio Salazar · Bo Li · Davis Liang · Christopher Wong · Glen Bigan Mbeng · Animesh Garg -
2018 Poster: Multi-armed Bandits with Compensation »
Siwei Wang · Longbo Huang -
2017 Poster: On the Optimization Landscape of Tensor Decompositions »
Rong Ge · Tengyu Ma -
2017 Oral: On the Optimization Landscape of Tensor Decompositions »
Rong Ge · Tengyu Ma -
2016 Oral: Matrix Completion has No Spurious Local Minimum »
Rong Ge · Jason Lee · Tengyu Ma -
2016 Poster: Matrix Completion has No Spurious Local Minimum »
Rong Ge · Jason Lee · Tengyu Ma -
2016 Poster: A Non-generative Framework and Convex Relaxations for Unsupervised Learning »
Elad Hazan · Tengyu Ma -
2015 Poster: Sum-of-Squares Lower Bounds for Sparse PCA »
Tengyu Ma · Avi Wigderson -
2014 Poster: On Communication Cost of Distributed Statistical Estimation and Dimensionality »
Ankit Garg · Tengyu Ma · Huy Nguyen -
2014 Oral: On Communication Cost of Distributed Statistical Estimation and Dimensionality »
Ankit Garg · Tengyu Ma · Huy Nguyen