NeurIPS Poster Improved Sample Complexity for Incremental Autonomous Exploration in MDPs

Poster

Improved Sample Complexity for Incremental Autonomous Exploration in MDPs

Jean Tarbouriech · Matteo Pirotta · Michal Valko · Alessandro Lazaric

Poster Session 1 #192

[ Abstract ] [ Paper PDF ]

[ Paper ]

Abstract: We investigate the exploration of an unknown environment when no reward function is provided. Building on the incremental exploration setting introduced by Lim and Auer [1], we define the objective of learning the set of

ϵ

$\epsilon$ -optimal goal-conditioned policies attaining all states that are incrementally reachable within

L

$L$ steps (in expectation) from a reference state

s_{0}

$s_0$ . In this paper, we introduce a novel model-based approach that interleaves discovering new states from

s_{0}

$s_0$ and improving the accuracy of a model estimate that is used to compute goal-conditioned policies to reach newly discovered states. The resulting algorithm, DisCo, achieves a sample complexity scaling as

\tilde{O} (L^{5} S_{L + ϵ} Γ_{L + ϵ} A ϵ^{- 2})

$\tilde{O}(L^5 S_{L+\epsilon} \Gamma_{L+\epsilon} A \epsilon^{-2})$ , where

A

$A$ is the number of actions,

S_{L + ϵ}

$S_{L+\epsilon}$ is the number of states that are incrementally reachable from

s_{0}

$s_0$ in

L + ϵ

$L+\epsilon$ steps, and

Γ_{L + ϵ}

$\Gamma_{L+\epsilon}$ is the branching factor of the dynamics over such states. This improves over the algorithm proposed in [1] in both

ϵ

$\epsilon$ and

L

$L$ at the cost of an extra

Γ_{L + ϵ}

$\Gamma_{L+\epsilon}$ factor, which is small in most environments of interest. Furthermore, DisCo is the first algorithm that can return an

ϵ / c_{min}

$\epsilon/c_{\min}$ -optimal policy for any cost-sensitive shortest-path problem defined on the

L

$L$ -reachable states with minimum cost

c_{min}

$c_{\min}$ . Finally, we report preliminary empirical results confirming our theoretical findings.

Chat is not available.