NeurIPS Poster Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs

Poster

Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs

Jiafan He · Dongruo Zhou · Quanquan Gu

Keywords: [ Reinforcement Learning and Planning ]

[ Abstract ]

[ OpenReview]

Abstract: We study the reinforcement learning problem for discounted Markov Decision Processes (MDPs) under the tabular setting. We propose a model-based algorithm named UCBVI-

$\gamma$ , which is based on the \emph{optimism in the face of uncertainty principle} and the Bernstein-type bonus. We show that UCBVI-

$\gamma$ achieves an

$\tilde{O}\big({\sqrt{SAT}}/{(1-\gamma)^{1.5}}\big)$ regret, where

$S$ is the number of states,

$A$ is the number of actions,

$\gamma$ is the discount factor and

$T$ is the number of steps. In addition, we construct a class of hard MDPs and show that for any algorithm, the expected regret is at least

$\tilde{\Omega}\big({\sqrt{SAT}}/{(1-\gamma)^{1.5}}\big)$ . Our upper bound matches the minimax lower bound up to logarithmic factors, which suggests that UCBVI-

$\gamma$ is nearly minimax optimal for discounted MDPs.

Chat is not available.