Timezone: »
We consider a general adversarial multiarmed blocking bandit setting where each played arm can be blocked (unavailable) for some time periods and the reward per arm is given at each time period adversarially without obeying any distribution. The setting models scenarios of allocating scarce limited supplies (e.g., arms) where the supplies replenish and can be reused only after certain time periods. We first show that, in the optimization setting, when the blocking durations and rewards are known in advance, finding an optimal policy (e.g., determining which arm per round) that maximises the cumulative reward is strongly NPhard, eliminating the possibility of a fully polynomialtime approximation scheme (FPTAS) for the problem unless P = NP. To complement our result, we show that a greedy algorithm that plays the best available arm at each round provides an approximation guarantee that depends on the blocking durations and the path variance of the rewards. In the bandit setting, when the blocking durations and rewards are not known, we design two algorithms, RGA and RGAMETA, for the case of bounded duration an path variation. In particular, when the variation budget BT is known in advance, RGA can achieve O(\sqrt{T(2\tilde{D}+K)B{T}}) dynamic approximate regret. On the other hand, when B_T is not known, we show that the dynamic approximate regret of RGAMETA is at most O((K+\tilde{D})^{1/4}\tilde{B}^{1/2}T^{3/4}) where \tilde{B} is the maximal path variation budget within each batch of RGAMETA (which is provably in order of o(\sqrt{T}). We also prove that if either the variation budget or the maximal blocking duration is unbounded, the approximate regret will be at least Theta(T). We also show that the regret upper bound of RGA is tight if the blocking durations are bounded above by an order of O(1).
Author Information
Nicholas Bishop (University of Southampton)
Hau Chan (University of NebraskaLincoln)
Debmalya Mandal (Columbia University)
Long TranThanh (University of Warwick)
More from the Same Authors

2022 : Multiresolution Mesh Networks For Learning Dynamical Fluid Simulations »
Bach Nguyen · Truong Son Hy · Long TranThanh · Risi Kondor 
2022 : Fuzzy cMeans Clustering in Persistence Diagram Space for Deep Learning Model Selection »
Thomas Davies · Jack Aspinall · Bryan Wilder · Long TranThanh 
2022 Poster: Expected Improvement for Contextual Bandits »
Hung TranThe · Sunil Gupta · Santu Rana · Tuan Truong · Long TranThanh · Svetha Venkatesh 
2020 : Spotlight: Hypothesis Classes with a Unique Persistence Diagram are Nonuniformly Learnable »
Nicholas Bishop · Long TranThanh · Thomas Davies 
2020 Poster: Optimal Learning from Verified Training Data »
Nicholas Bishop · Long TranThanh · Enrico Gerding