NeurIPS Poster Multi-Step Generalized Policy Improvement by Leveraging Approximate Models

Poster

Multi-Step Generalized Policy Improvement by Leveraging Approximate Models

Lucas N. Alegre · Ana Bazzan · Ann Nowe · Bruno C. da Silva

Great Hall & Hall B1+B2 (level 1) #1426

[ Abstract ]

[ Paper] [ Poster] [ OpenReview]

Abstract: We introduce a principled method for performing zero-shot transfer in reinforcement learning (RL) by exploiting approximate models of the environment. Zero-shot transfer in RL has been investigated by leveraging methods rooted in generalized policy improvement (GPI) and successor features (SFs). Although computationally efficient, these methods are model-free: they analyze a library of policies---each solving a particular task---and identify which action the agent should take. We investigate the more general setting where, in addition to a library of policies, the agent has access to an approximate environment model. Even though model-based RL algorithms can identify near-optimal policies, they are typically computationally intensive. We introduce

h

$h$ -GPI, a multi-step extension of GPI that interpolates between these extremes---standard model-free GPI and fully model-based planning---as a function of a parameter,

h

$h$ , regulating the amount of time the agent has to reason. We prove that

h

$h$ -GPI's performance lower bound is strictly better than GPI's, and show that

h

$h$ -GPI generally outperforms GPI as

h

$h$ increases. Furthermore, we prove that as

h

$h$ increases,

h

$h$ -GPI's performance becomes arbitrarily less susceptible to sub-optimality in the agent's policy library. Finally, we introduce novel bounds characterizing the gains achievable by

h

$h$ -GPI as a function of approximation errors in both the agent's policy library and its (possibly learned) model. These bounds strictly generalize those known in the literature. We evaluate

h

$h$ -GPI on challenging tabular and continuous-state problems under value function approximation and show that it consistently outperforms GPI and state-of-the-art competing methods under various levels of approximation errors.

Chat is not available.