ALPs: Adaptive Lookahead Policy Gradients for Multi-Agent Reinforcement Learning
Abstract
We study the dynamics of two-player policy gradient (PG) methods through the lens of continuous time analysis. Motivated by recent advances in higher-order regularized differential equations (HRDE) for lookahead optimization [25], and by connections to competitive game optimization [28], we derive the HRDE associated with the Lookahead method (with gradient ascent as base optimizer) in the setting of two interacting agents. Our analysis reveals key differences between two- player RL and classical game optimization, arising from discounting and the strategy-dependent expectation structure of policy gradients. Using Laplace transform techniques, we establish conditions for stability and convergence, and develop principled hyperparameter selection rules for the lookahead depth k, interpolation parameter α, and base step size ε. Building on these conditions, we introduce an adaptive variant, the Adaptive Lookahead Policy Gradients (ALPs), which dynamically tunes hyperparameters to improve the learning stability. Our results bridge optimization and RL perspectives on competitive learning, offering both an improved understanding of fundamental differences and practical guidance for stable training in multi-agent policy gradient methods.