Skip to yearly menu bar Skip to main content

Workshop: OPT 2021: Optimization for Machine Learning

Policy Mirror Descent for Regularized RL: A Generalized Framework with Linear Convergence

Wenhao Zhan · Shicong Cen · Baihe Huang · Yuxin Chen · Jason Lee · Yuejie Chi


Policy optimization, which learns the policy of interest by maximizing the value function via large-scale optimization, lies at the heart of modern reinforcement learning (RL). In addition to value maximization, other practical considerations arise commonly as well, including the need to encourage exploration, and to ensure certain structural properties due to safety, resource and operational constraints. These considerations can often be accounted for by resorting to regularized RL. Focusing on an infinite-horizon discounted Markov decision process, we propose a generalized policy mirror descent (GPMD) algorithm for solving regularized RL. As a generalization of policy mirror descent [Lan, 2021], the proposed algorithm accommodates a general class of convex regularizers as well as a broad family of Bregman divergence in cognizant of the regularizer in use. We prove that our algorithm converges linearly over an entire range of learning rates, in a dimension-free fashion, to the global solution, even when the regularizer lacks strong convexity and smoothness. In addition, this fast convergence feature is provably stable in the face of inexact policy evaluation and imperfect policy updates. Numerical experiments are provided to confirm the applicability of GPMD.