Timezone: »

Regret Minimization in MDPs with Options without Prior Knowledge
Ronan Fruit · Matteo Pirotta · Alessandro Lazaric · Emma Brunskill

Wed Dec 06 05:35 PM -- 05:40 PM (PST) @ Hall A

The option framework integrates temporal abstraction into the reinforcement learning model through the introduction of macro-actions (i.e., options). Recent works leveraged on the mapping of Markov decision processes (MDPs) with options to semi-MDPs (SMDPs) and introduced SMDP-versions of exploration-exploitation algorithms (e.g., \rmaxsmdp and \ucrlsmdp) to analyze the impact of options on the learning performance. Nonetheless, the PAC-SMDP sample complexity of \rmaxsmdp can hardly be translated into equivalent PAC-MDP theoretical guarantees, while \ucrlsmdp requires prior knowledge of the parameters characterizing the distributions of the cumulative reward and duration of each option, which are hardly available in practice. In this paper we remove this limitation by combining the SMDP view together with the inner Markov structure of options into a novel algorithm whose regret performance matches \ucrlsmdp's up to an additive regret term. We show scenarios where this term is negligible and the advantage of temporal abstraction is preserved. We also report preliminary empirical result supporting the theoretical findings.

Author Information

Ronan Fruit (Inria Lille)
Matteo Pirotta (Facebook AI Research)
Alessandro Lazaric (Facebook Artificial Intelligence Research)
Emma Brunskill (CMU)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors