Timezone: »
Poster
Variational Bayesian Reinforcement Learning with Regret Bounds
Brendan O'Donoghue
We consider the exploration-exploitation trade-off in reinforcement learning and show that an agent endowed with an exponential epistemic-risk-seeking utility function explores efficiently, as measured by regret. The state-action values induced by the exponential utility satisfy a Bellman recursion, so we can use dynamic programming to compute them. We call the resulting algorithm K-learning (for knowledge) and the risk-seeking utility ensures that the associated state-action values (K-values) are optimistic for the expected optimal Q-values under the posterior. The exponential utility function induces a Boltzmann exploration policy for which the 'temperature' parameter is equal to the risk-seeking parameter and is carefully controlled to yield a Bayes regret bound of $\tilde O(L^{3/2} \sqrt{S A T})$, where $L$ is the time horizon, $S$ is the number of states, $A$ is the number of actions, and $T$ is the total number of elapsed timesteps. We conclude with a numerical example demonstrating that K-learning is competitive with other state-of-the-art algorithms in practice.
Author Information
Brendan O'Donoghue (DeepMind)
More from the Same Authors
-
2021 Spotlight: Variational Bayesian Optimistic Sampling »
Brendan O'Donoghue · Tor Lattimore -
2021 Spotlight: Reward is enough for convex MDPs »
Tom Zahavy · Brendan O'Donoghue · Guillaume Desjardins · Satinder Singh -
2021 Poster: Reward is enough for convex MDPs »
Tom Zahavy · Brendan O'Donoghue · Guillaume Desjardins · Satinder Singh -
2021 Poster: Practical Large-Scale Linear Programming using Primal-Dual Hybrid Gradient »
David Applegate · Mateo Diaz · Oliver Hinder · Haihao Lu · Miles Lubin · Brendan O'Donoghue · Warren Schudy -
2021 Poster: Variational Bayesian Optimistic Sampling »
Brendan O'Donoghue · Tor Lattimore -
2019 Poster: Hamiltonian descent for composite objectives »
Brendan O'Donoghue · Chris Maddison