NeurIPS Poster Stochastic Multi-armed Bandits: Optimal Trade-off among Optimality, Consistency, and Tail Risk

Spotlight Poster

Stochastic Multi-armed Bandits: Optimal Trade-off among Optimality, Consistency, and Tail Risk

David Simchi-Levi · Zeyu Zheng · Feng Zhu

Great Hall & Hall B1+B2 (level 1) #1820

[ Abstract ]

[ Paper] [ Slides] [ Poster] [ OpenReview]

Abstract: We consider the stochastic multi-armed bandit problem and fully characterize the interplays among three desired properties for policy design: worst-case optimality, instance-dependent consistency, and light-tailed risk. We show how the order of expected regret exactly affects the decaying rate of the regret tail probability for both the worst-case and instance-dependent scenario. A novel policy is proposed to achieve the optimal regret tail risk for any regret threshold. Concretely, for any given

α \in [1 / 2, 1)

$\alpha\in[1/2, 1)$ and

β \in [0, 1)

$\beta\in[0, 1)$ , our policy achieves a worst-case expected regret of

\tilde{O} (T^{α})

$\tilde O(T^\alpha)$ and instance-dependent expected regret of

\tilde{O} (T^{β})

$\tilde O(T^\beta)$ , while enjoys a probability of incurring an

Ω (T^{δ})

$\Omega(T^\delta)$ regret that decays exponentially with a polynomial

T

$T$ term. Such decaying rate is proved to be best achievable. We also generalize our analysis to the stochastic multi-armed bandit problem with non-stationary baseline rewards, where in each time period

t

$t$ , the decision maker pulls one of

K

$K$ arms and collects a reward which is the sum of three terms: the mean of the pulled arm, an independent noise, and a non-stationary baseline reward as a function of

t

$t$ . Our results reveal insights on the trade-off between expected regret and tail risk for both worst-case and instance-dependent scenario, indicating that more sub-optimality and inconsistency leaves space for more light-tailed risk of incurring a large regret.

Chat is not available.