Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity
Kaiqing Zhang, Sham Kakade, Tamer Basar, Lin Yang
Spotlight presentation: Orals & Spotlights Track 04: Reinforcement Learning
on 2020-12-07T20:10:00-08:00 - 2020-12-07T20:20:00-08:00
on 2020-12-07T20:10:00-08:00 - 2020-12-07T20:20:00-08:00
Poster Session 1 (more posters)
on 2020-12-07T21:00:00-08:00 - 2020-12-07T23:00:00-08:00
GatherTown: Reinforcement learning and planning ( Town D0 - Spot D1 )
on 2020-12-07T21:00:00-08:00 - 2020-12-07T23:00:00-08:00
GatherTown: Reinforcement learning and planning ( Town D0 - Spot D1 )
Join GatherTown
Only iff poster is crowded, join Zoom . Authors have to start the Zoom call from their Profile page / Presentation History.
Only iff poster is crowded, join Zoom . Authors have to start the Zoom call from their Profile page / Presentation History.
Toggle Abstract Paper (in Proceedings / .pdf)
Abstract: Model-based reinforcement learning (RL), which finds an optimal policy using an empirical model, has long been recognized as one of the cornerstones of RL. It is especially suitable for multi-agent RL (MARL), as it naturally decouples the learning and the planning phases, and avoids the non-stationarity problem when all agents are improving their policies simultaneously using samples. Though intuitive and widely-used, the sample complexity of model-based MARL algorithms has been investigated relatively much less often. In this paper, we aim to address the fundamental open question about the sample complexity of model-based MARL. We study arguably the most basic MARL setting: two-player discounted zero-sum Markov games, given only access to a generative model of state transition. We show that model-based MARL achieves a sample complexity of $\tilde \cO(|\cS||\cA||\cB|(1-\gamma)^{-3}\epsilon^{-2})$ for finding the Nash equilibrium (NE) \emph{value} up to some $\epsilon$ error, and the $\epsilon$-NE \emph{policies}, where $\gamma$ is the discount factor, and $\cS,\cA,\cB$ denote the state space, and the action spaces for the two agents. We also show that this method is near-minimax optimal with a tight dependence on $1-\gamma$ and $|\cS|$ by providing a lower bound of $\Omega(|\cS|(|\cA|+|\cB|)(1-\gamma)^{-3}\epsilon^{-2})$. Our results justify the efficiency of this simple model-based approach in the multi-agent RL setting.