Poster
in
Workshop: OPT 2023: Optimization for Machine Learning
From 6235149080811616882909238708 to 29: Vanilla Thompson Sampling Revisited
Bingshan Hu · Tianyue Zhang
Abstract:
In this work, we derive a new problem-dependent regret bound for Thompson Sampling with Gaussian priors (Algorithm 2 in [Agrawal and Goyal, 2017]), a classical stochastic bandit algorithm that has demonstrated excellent empirical performance and has been widely deployed in real-world applications. The existing regret bound is ∑i∈A:Δi>0288(e64+6)ln(TΔ2i+e32)Δi+10.5Δi+Δi∑i∈A:Δi>0288(e64+6)ln(TΔ2i+e32)Δi+10.5Δi+Δi, where AA is the arm set, ΔiΔi denotes the single round performance loss when pulling a sub-optimal arm ii instead of the optimal arm, and TT is the learning time horizon. Since real-world learning tasks care about the performance with a finite learning time horizon TT, the existing regret bound is only non-vacuous when T>288⋅e64T>288⋅e64, which may not be practical. Our new regret bound is ∑i∈A:Δi>01252ln(TΔ2i+10013)Δi+18ln(TΔ2i)Δi+182.5Δi+Δi∑i∈A:Δi>01252ln(TΔ2i+10013)Δi+18ln(TΔ2i)Δi+182.5Δi+Δi, which tightens the leading term's coefficient significantly. Despite having made some improvements, we would like to emphasize that the goal of this work is to deepen the understanding of Thompson Sampling from a theoretical perspective to unlock the full potential of this classical learning algorithm for solving challenging real-world learning problems.
Chat is not available.