NeurIPS Poster Choice Bandits

Poster

Choice Bandits

Arpit Agarwal · Nicholas Johnson · Shivani Agarwal

Poster Session 2 #587

[ Abstract ] [ Paper PDF ]

[ Paper ]

Abstract: There has been much interest in recent years in the problem of dueling bandits, where on each round the learner plays a pair of arms and receives as feedback the outcome of a relative pairwise comparison between them. Here we study a natural generalization, that we term \emph{choice bandits}, where the learner plays a set of up to

k \geq 2

$k \geq 2$ arms and receives limited relative feedback in the form of a single multiway choice among the pulled arms, drawn from an underlying multiway choice model. We study choice bandits under a very general class of choice models that is characterized by the existence of a unique `best' arm (which we term generalized Condorcet winner), and includes as special cases the well-studied multinomial logit (MNL) and multinomial probit (MNP) choice models, and more generally, the class of random utility models with i.i.d. noise (IID-RUMs). We propose an algorithm for choice bandits, termed Winner Beats All (WBA), with distribution dependent

O (log T)

$O(\log T)$ regret bound under all these choice models. The challenge in our setting is that the decision space is

Θ (n^{k})

$\Theta(n^k)$ , which is large for even moderate

k

$k$ . Our algorithm addresses this challenge by extracting just

O (n^{2})

$O(n^2)$ statistics from multiway choices and exploiting the existence of a unique `best' arm to find arms that are competitive to this arm in order to construct sets with low regret. Since these statistics are extracted from the same choice observations, one needs a careful martingale analysis in order to show that these statistics are concentrated. We complement our upper bound result with a lower bound result, which shows that our upper bound is order-wise optimal. Our experiments demonstrate that for the special case of

k = 2

$k=2$ , our algorithm is competitive when compared to previous dueling bandit algorithms, and for the more general case

k > 2

$k>2$ , outperforms the recently proposed MaxMinUCB algorithm designed for the MNL model.

Chat is not available.