Poster
Optimistic posterior sampling for reinforcement learning: worst-case regret bounds
Shipra Agrawal · Randy Jia
Pacific Ballroom #1
Keywords: [ Bandit Algorithms ] [ Reinforcement Learning ]
[
Abstract
]
Abstract:
We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov Decision Process (MDP) is communicating with a finite, though unknown, diameter. Our main result is a high probability regret upper bound of ˜O(D√SAT) for any communicating MDP with S states, A actions and diameter D, when T≥S5A. Here, regret compares the total reward achieved by the algorithm to the total expected reward of an optimal infinite-horizon undiscounted average reward policy, in time horizon T. This result improves over the best previously known upper bound of ˜O(DS√AT) achieved by any algorithm in this setting, and matches the dependence on S in the established lower bound of Ω(√DSAT) for this problem. Our techniques involve proving some novel results about the anti-concentration of Dirichlet distribution, which may be of independent interest.
Live content is unavailable. Log in and register to view live content