Poster
Optimistic posterior sampling for reinforcement learning: worst-case regret bounds
Shipra Agrawal · Randy Jia
Pacific Ballroom #1
Keywords: [ Bandit Algorithms ] [ Reinforcement Learning ]
[
Abstract
]
Abstract:
We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov Decision Process (MDP) is communicating with a finite, though unknown, diameter. Our main result is a high probability regret upper bound of for any communicating MDP with states, actions and diameter , when . Here, regret compares the total reward achieved by the algorithm to the total expected reward of an optimal infinite-horizon undiscounted average reward policy, in time horizon . This result improves over the best previously known upper bound of achieved by any algorithm in this setting, and matches the dependence on in the established lower bound of for this problem. Our techniques involve proving some novel results about the anti-concentration of Dirichlet distribution, which may be of independent interest.
Live content is unavailable. Log in and register to view live content