NeurIPS Poster Nearly Minimax Optimal Regret for Multinomial Logistic Bandit

Poster

Nearly Minimax Optimal Regret for Multinomial Logistic Bandit

Joongkyu Lee · Min-hwan Oh

West Ballroom A-D #5510

[ Abstract ]

[ Paper] [ Slides] [ Poster] [ OpenReview]

Fri 13 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract: In this paper, we study the contextual multinomial logit (MNL) bandit problem in which a learning agent sequentially selects an assortment based on contextual information, and user feedback follows an MNL choice model.There has been a significant discrepancy between lower and upper regret bounds, particularly regarding the maximum assortment size

K

$K$ . Additionally, the variation in reward structures between these bounds complicates the quest for optimality. Under uniform rewards, where all items have the same expected reward, we establish a regret lower bound of

Ω (d \sqrt{T / K})

$\Omega(d\sqrt{\smash[b]{T/K}})$ and propose a constant-time algorithm, OFU-MNL+, that achieves a matching upper bound of

\tilde{O} (d \sqrt{T / K})

$\tilde{\mathcal{O}}(d\sqrt{\smash[b]{T/K}})$ . We also provide instance-dependent minimax regret bounds under uniform rewards.Under non-uniform rewards, we prove a lower bound of

Ω (d \sqrt{T})

$\Omega(d\sqrt{T})$ and an upper bound of

\tilde{O} (d \sqrt{T})

$\tilde{\mathcal{O}}(d\sqrt{T})$ , also achievable by OFU-MNL+. Our empirical studies support these theoretical findings. To the best of our knowledge, this is the first work in the contextual MNL bandit literature to prove minimax optimality --- for either uniform or non-uniform reward setting --- and to propose a computationally efficient algorithm that achieves this optimality up to logarithmic factors.

Chat is not available.