NeurIPS Poster A Best-of-Both-Worlds Algorithm for Bandits with Delayed Feedback

Poster

A Best-of-Both-Worlds Algorithm for Bandits with Delayed Feedback

Saeed Masoudian · Julian Zimmert · Yevgeny Seldin

Hall J (level 1) #818

Keywords: [ Best-of-both-worlds ] [ Delayed Bandit ] [ multi-armed bandit ]

[ Abstract ]

[ Paper] [ Poster] [ OpenReview]

Abstract: We present a modified tuning of the algorithm of Zimmert and Seldin [2020] for adversarial multiarmed bandits with delayed feedback, which in addition to the minimax optimal adversarial regret guarantee shown by Zimmert and Seldin [2020] simultaneously achieves a near-optimal regret guarantee in the stochastic setting with fixed delays. Specifically, the adversarial regret guarantee is

$\mathcal{O}(\sqrt{TK} + \sqrt{dT\log K})$ , where

$T$ is the time horizon,

$K$ is the number of arms, and

$d$ is the fixed delay, whereas the stochastic regret guarantee is

$\mathcal{O}\left(\sum_{i \neq i^*}(\frac{1}{\Delta_i} \log(T) + \frac{d}{\Delta_{i}}) + d K^{1/3}\log K\right)$ , where

$\Delta_i$ are the suboptimality gaps. We also present an extension of the algorithm to the case of arbitrary delays, which is based on an oracle knowledge of the maximal delay

$d_{max}$ and achieves

$\mathcal{O}(\sqrt{TK} + \sqrt{D\log K} + d_{max}K^{1/3} \log K)$ regret in the adversarial regime, where

$D$ is the total delay, and

$\mathcal{O}\left(\sum_{i \neq i^*}(\frac{1}{\Delta_i} \log(T) + \frac{\sigma_{max}}{\Delta_{i}}) + d_{max}K^{1/3}\log K\right)$ regret in the stochastic regime, where

$\sigma_{max}$ is the maximal number of outstanding observations. Finally, we present a lower bound that matches regret upper bound achieved by the skipping technique of Zimmert and Seldin [2020] in the adversarial setting.

Chat is not available.