NIPS Poster Near-optimal Reinforcement Learning in Factored MDPs

Poster

Near-optimal Reinforcement Learning in Factored MDPs

Ian Osband · Benjamin Van Roy

Level 2, room 210D

[ Abstract ]

Abstract: Any reinforcement learning algorithm that applies to all Markov decision processes (MDPs) will suffer

Ω (\sqrt{S A T})

$\Omega(\sqrt{SAT})$ regret on some MDP, where

T

$T$ is the elapsed time and

S

$S$ and

A

$A$ are the cardinalities of the state and action spaces. This implies

T = Ω (S A)

$T = \Omega(SA)$ time to guarantee a near-optimal policy. In many settings of practical interest, due to the curse of dimensionality,

S

$S$ and

A

$A$ can be so enormous that this learning time is unacceptable. We establish that, if the system is known to be a \emph{factored} MDP, it is possible to achieve regret that scales polynomially in the number of \emph{parameters} encoding the factored MDP, which may be exponentially smaller than

S

$S$ or

A

$A$ . We provide two algorithms that satisfy near-optimal regret bounds in this context: posterior sampling reinforcement learning (PSRL) and an upper confidence bound algorithm (UCRL-Factored).

Live content is unavailable. Log in and register to view live content