Algorithm design and sharper bounds for improving-bandits
Avrim Blum · Marten Garicano · Kavya Ravichandran · Dravyansh Sharma
Abstract
The improving multi-armed bandits problem is a highly relevant non-stationary bandits setting with the potential to effectively model a wide range of applications including online advertising, content or product recommendations, clinical trials, and hyperparameter selection from learning curves. The rewards from any arm are assumed to increase monotonically with each pull but with diminishing returns. A growing line of work has designed algorithms for improving-bandits, albeit with somewhat pessimistic worst-case guarantees. Indeed, strong lower bounds of $\Omega(k)$ and $\Omega(\sqrt{k})$ multiplicative approximation factors are known for both deterministic and randomized algorithms (respectively) relative to the optimal arm, where $k$ is the number of bandit arms. In this work, we propose a new parameterized family of bandit algorithms that includes the optimal randomized algorithm from prior work. We show that an appropriately chosen algorithm can achieve stronger guarantees, with optimal dependence on $k$, when the arm reward curves satisfy additional properties related to the strength of concavity. We further bound the sample complexity of learning a near-optimal algorithm from the family using offline data. Taking a statistical learning perspective on the bandit rewards optimization problem, we achieve stronger data-dependent guarantees without the need for actually verifying whether the assumptions are satisfied.
Chat is not available.
Successful Page Load