NeurIPS Poster Improved Analysis for Bandit Learning in Matching Markets

Poster

Improved Analysis for Bandit Learning in Matching Markets

Fang Kong · Zilong Wang · Shuai Li

West Ballroom A-D #5702

[ Abstract ]

[ Paper] [ Poster] [ OpenReview]

Wed 11 Dec 11 a.m. PST — 2 p.m. PST

Abstract: A rich line of works study the bandit learning problem in two-sided matching markets, where one side of market participants (players) are uncertain about their preferences and hope to find a stable matching during iterative matchings with the other side (arms). The state-of-the-art analysis shows that the player-optimal stable regret is of order

O (K \log T / Δ^{2})

$O(K\log T/\Delta^2)$ where

K

$K$ is the number of arms,

T

$T$ is the horizon and

Δ

$\Delta$ is the players' minimum preference gap. However, this result may be far from the lower bound

Ω (max {N \log T / Δ^{2}, K \log T / Δ})

$\Omega(\max\{N\log T/\Delta^2, K\log T/\Delta\})$ since the number

K

$K$ of arms (workers, publisher slots) may be much larger than that

N

$N$ of players (employers in labor markets, advertisers in online advertising, respectively). In this paper, we propose a new algorithm and show that the regret can be upper bounded by

O (N^{2} \log T / Δ^{2} + K \log T / Δ)

$O(N^2\log T/\Delta^2 + K \log T/\Delta)$ . This result removes the dependence on

K

$K$ in the main order term and improves the state-of-the-art guarantee in common cases where

N

$N$ is much smaller than

K

$K$ . Such an advantage is also verified in experiments. In addition, we provide a refined analysis for the existing centralized UCB algorithm and show that, under

α

$\alpha$ -condition, it achieves an improved

O (N \log T / Δ^{2} + K \log T / Δ)

$O(N \log T/\Delta^2 + K \log T / \Delta)$ regret.

Chat is not available.