Poster
Finding good policies in average-reward Markov Decision Processes without prior knowledge
Adrienne Tuynman · Rémy Degenne · Emilie Kaufmann
West Ballroom A-D #6607
Abstract:
We revisit the identification of an -optimal policy in average-reward Markov Decision Processes (MDP). In such MDPs, two measures of complexity have appeared in the literature: the diameter, , and the optimal bias span, , which satisfy . Prior work have studied the complexity of -optimal policy identification only when a generative model is available. In this case, it is known that there exists an MDP with for which the sample complexity to output an -optimal policy is where and are the sizes of the state and action spaces. Recently, an algorithm with a sample complexity of order has been proposed, but it requires the knowledge of . We first show that the sample complexity required to estimate is not bounded by any function of and , ruling out the possibility to easily make the previous algorithm agnostic to . By relying instead on a diameter estimation procedure, we propose the first algorithm for -PAC policy identification that does not need any form of prior knowledge on the MDP. Its sample complexity scales in in the regime of small , which is near-optimal. In the online setting, our first contribution is a lower bound which implies that a sample complexity polynomial in cannot be achieved in this setting. Then, we propose an online algorithm with a sample complexity in , as well as a novel approach based on a data-dependent stopping rule that we believe is promising to further reduce this bound.
Chat is not available.