Spotlight Poster
Stochastic Multi-armed Bandits: Optimal Trade-off among Optimality, Consistency, and Tail Risk
David Simchi-Levi · Zeyu Zheng · Feng Zhu
Great Hall & Hall B1+B2 (level 1) #1820
Abstract:
We consider the stochastic multi-armed bandit problem and fully characterize the interplays among three desired properties for policy design: worst-case optimality, instance-dependent consistency, and light-tailed risk. We show how the order of expected regret exactly affects the decaying rate of the regret tail probability for both the worst-case and instance-dependent scenario. A novel policy is proposed to achieve the optimal regret tail risk for any regret threshold. Concretely, for any given and , our policy achieves a worst-case expected regret of and instance-dependent expected regret of , while enjoys a probability of incurring an regret that decays exponentially with a polynomial term. Such decaying rate is proved to be best achievable. We also generalize our analysis to the stochastic multi-armed bandit problem with non-stationary baseline rewards, where in each time period , the decision maker pulls one of arms and collects a reward which is the sum of three terms: the mean of the pulled arm, an independent noise, and a non-stationary baseline reward as a function of . Our results reveal insights on the trade-off between expected regret and tail risk for both worst-case and instance-dependent scenario, indicating that more sub-optimality and inconsistency leaves space for more light-tailed risk of incurring a large regret.
Chat is not available.