Skip to yearly menu bar Skip to main content


Poster

Exploration Bonus for Regret Minimization in Discrete and Continuous Average Reward MDPs

Jian QIAN · Ronan Fruit · Matteo Pirotta · Alessandro Lazaric

East Exhibition Hall B, C #182

Keywords: [ Markov Decision Processes ] [ Reinforcement Learning and Planning ] [ Exploration ]


Abstract: The exploration bonus is an effective approach to manage the exploration-exploitation trade-off in Markov Decision Processes (MDPs). While it has been analyzed in infinite-horizon discounted and finite-horizon problems, we focus on designing and analysing the exploration bonus in the more challenging infinite-horizon undiscounted setting. We first introduce SCAL+, a variant of SCAL (Fruit et al. 2018), that uses a suitable exploration bonus to solve any discrete unknown weakly-communicating MDP for which an upper bound $c$ on the span of the optimal bias function is known. We prove that SCAL+ enjoys the same regret guarantees as SCAL, which relies on the less efficient extended value iteration approach. Furthermore, we leverage the flexibility provided by the exploration bonus scheme to generalize SCAL+ to smooth MDPs with continuous state space and discrete actions. We show that the resulting algorithm (SCCAL+) achieves the same regret bound as UCCRL (Ortner and Ryabko, 2012) while being the first implementable algorithm for this setting.

Live content is unavailable. Log in and register to view live content