Timezone: »
Policy gradient (PG) methods are popular reinforcement learning (RL) methods where a baseline is often applied to reduce the variance of gradient estimates. In multi-agent RL (MARL), although the PG theorem can be naturally extended, the effectiveness of multi-agent PG (MAPG) methods degrades as the variance of gradient estimates increases rapidly with the number of agents. In this paper, we offer a rigorous analysis of MAPG methods by, firstly, quantifying the contributions of the number of agents and agents' explorations to the variance of MAPG estimators. Based on this analysis, we derive the optimal baseline (OB) that achieves the minimal variance. In comparison to the OB, we measure the excess variance of existing MARL algorithms such as vanilla MAPG and COMA. Considering using deep neural networks, we also propose a surrogate version of OB, which can be seamlessly plugged into any existing PG methods in MARL. On benchmarks of Multi-Agent MuJoCo and StarCraft challenges, our OB technique effectively stabilises training and improves the performance of multi-agent PPO and COMA algorithms by a significant margin. Code is released at \url{https://github.com/morning9393/Optimal-Baseline-for-Multi-agent-Policy-Gradients}.
Author Information
Jakub Grudzien Kuba (Huawei Technologies Ltd.)
PhD student at BAIR, UC Berkeley, working on deep reinforcement learning.
Muning Wen (Shanghai Jiao Tong University)
Linghui Meng (Institute of automation, Chinese Academy of Sciences)
shangding gu (Technical University of Munich)
Haifeng Zhang (Institute of automation, Chinese academy of science, Chinese Academy of Sciences)
David Mguni (PROWLER.io)
Senior Machine Learning Researcher, PROWLER.io Mar, 2017 -
Jun Wang (University College London)
Yaodong Yang (University College London)
More from the Same Authors
-
2022 : Contextual Transformer for Offline Meta Reinforcement Learning »
Runji Lin · Ye Li · Xidong Feng · Zhaowei Zhang · XIAN HONG WU FUNG · Haifeng Zhang · Jun Wang · Yali Du · Yaodong Yang -
2023 Poster: An Efficient End-to-End Training Approach for Zero-Shot Human-AI Coordination »
Xue Yan · Jiaxian Guo · Xingzhou Lou · Jun Wang · Haifeng Zhang · Yali Du -
2022 Poster: Multi-Agent Reinforcement Learning is a Sequence Modeling Problem »
Muning Wen · Jakub Kuba · Runji Lin · Weinan Zhang · Ying Wen · Jun Wang · Yaodong Yang -
2022 Poster: Discovered Policy Optimisation »
Chris Lu · Jakub Kuba · Alistair Letcher · Luke Metz · Christian Schroeder de Witt · Jakob Foerster -
2022 Poster: A Theoretical Understanding of Gradient Bias in Meta-Reinforcement Learning »
Bo Liu · Xidong Feng · Jie Ren · Luo Mai · Rui Zhu · Haifeng Zhang · Jun Wang · Yaodong Yang -
2021 : Performance-Guaranteed ODE Solvers with Complexity-Informed Neural Networks »
Feng Zhao · Xiang Chen · Jun Wang · Zuoqiang Shi · Shao-Lun Huang -
2021 Poster: Towards Unifying Behavioral and Response Diversity for Open-ended Learning in Zero-sum Games »
Xiangyu Liu · Hangtian Jia · Ying Wen · Yujing Hu · Yingfeng Chen · Changjie Fan · ZHIPENG HU · Yaodong Yang -
2021 Poster: Neural Auto-Curricula in Two-Player Zero-Sum Games »
Xidong Feng · Oliver Slumbers · Ziyu Wan · Bo Liu · Stephen McAleer · Ying Wen · Jun Wang · Yaodong Yang -
2020 Poster: Replica-Exchange Nos\'e-Hoover Dynamics for Bayesian Learning on Large Datasets »
Rui Luo · Qiang Zhang · Yaodong Yang · Jun Wang -
2018 : Poster Session »
Zihan Ding · David Mguni · Yuzheng Zhuang · Edouard Leurent · Takuma Oda · Yulia Tachibana · PaweÅ‚ Gora · Neema Davis · Nemanja Djuric · Fang-Chieh Chou · elmira amirloo