Timezone: »
Poster
A Simple and Optimal Policy Design for Online Learning with Safety against Heavy-tailed Risk
David Simchi-Levi · Zeyu Zheng · Feng Zhu
We consider the classical multi-armed bandit problem and design simple-to-implement new policies that simultaneously enjoy two properties: worst-case optimality for the expected regret, and safety against heavy-tailed risk for the regret distribution. Recently, Fan and Glynn (2021) showed that information-theoretic optimized bandit policies as well as standard UCB policies suffer from some serious heavy-tailed risk; that is, the probability of incurring a linear regret slowly decays at a polynomial rate of $1/T$, as $T$ (the time horizon) increases. Inspired by their result, we further show that any policy that incurs an instance-dependent $O(\ln T)$ regret must incur a linear regret with probability $\Omega(\mathrm{poly}(1/T))$ and that the heavy-tailed risk actually exists for all "instance-dependent consistent" policies. Next, for the two-armed bandit setting, we provide a simple policy design that (i) has the worst-case optimality for the expected regret at order $\tilde O(\sqrt{T})$ and (ii) has the worst-case tail probability of incurring a linear regret decay at an exponential rate $\exp(-\Omega(\sqrt{T}))$. We further prove that this exponential decaying rate of the tail probability is optimal across all policies that have worst-case optimality for the expected regret. Finally, we generalize the policy design and analysis to the general setting with an arbitrary $K$ number of arms. We provide detailed characterization of the tail probability bound for any regret threshold under our policy design. Numerical experiments are conducted to illustrate the theoretical findings. Our results reveal insights on the incompatibility between consistency and light-tailed risk, whereas indicate that worst-case optimality on expected regret and light-tailed risk are compatible.
Author Information
David Simchi-Levi (MIT)
Zeyu Zheng (University of California Berkeley)
Feng Zhu (Massachusetts Institute of Technology)
More from the Same Authors
-
2021 : Offline Reinforcement Learning: Fundamental Barriers for Value Function Approximation »
Dylan Foster · Akshay Krishnamurthy · David Simchi-Levi · Yunzong Xu -
2022 : Mind Your Step: Continuous Conditional GANs with Generator Regularization »
Yunkai Zhang · Yufeng Zheng · Amber Ma · Siyuan Teng · Zeyu Zheng -
2022 Poster: Gradient-Free Methods for Deterministic and Stochastic Nonsmooth Nonconvex Optimization »
Tianyi Lin · Zeyu Zheng · Michael Jordan -
2022 Poster: Learning Mixed Multinomial Logits with Provable Guarantees »
Yiqun Hu · David Simchi-Levi · Zhenzhen Yan -
2022 Poster: Context-Based Dynamic Pricing with Partially Linear Demand Model »
Jinzhi Bu · David Simchi-Levi · Chonghuan Wang -
2021 : Contributed Talk 3: Offline Reinforcement Learning: Fundamental Barriers for Value Function Approximation »
Yunzong Xu · Akshay Krishnamurthy · David Simchi-Levi -
2021 Poster: Stochastic $L^\natural$-convex Function Minimization »
Haixiang Zhang · Zeyu Zheng · Javad Lavaei -
2019 Poster: Phase Transitions and Cyclic Phenomena in Bandits with Switching Constraints »
David Simchi-Levi · Yunzong Xu -
2018 Poster: The Lingering of Gradients: How to Reuse Gradients Over Time »
Zeyuan Allen-Zhu · David Simchi-Levi · Xinshang Wang