Timezone: »

Double Thompson Sampling for Dueling Bandits
Huasen Wu · Xin Liu

Wed Dec 07 09:00 AM -- 12:30 PM (PST) @ Area 5+6+7+8 #108
In this paper, we propose a Double Thompson Sampling (D-TS) algorithm for dueling bandit problems. As its name suggests, D-TS selects both the first and the second candidates according to Thompson Sampling. Specifically, D-TS maintains a posterior distribution for the preference matrix, and chooses the pair of arms for comparison according to two sets of samples independently drawn from the posterior distribution. This simple algorithm applies to general Copeland dueling bandits, including Condorcet dueling bandits as its special case. For general Copeland dueling bandits, we show that D-TS achieves $O(K^2 \log T)$ regret. Moreover, using a back substitution argument, we refine the regret to $O(K \log T + K^2 \log \log T)$ in Condorcet dueling bandits and many practical Copeland dueling bandits. In addition, we propose an enhancement of D-TS, referred to as D-TS+, that reduces the regret by carefully breaking ties. Experiments based on both synthetic and real-world data demonstrate that D-TS and D-TS$^+$ significantly improve the overall performance, in terms of regret and robustness.

Author Information

Huasen Wu (University of California at Davis)

Huasen Wu received his B.S. and Ph.D. degrees from the School of Electronic and Information Engineering, Beihang University, Beijing, in 2007 and 2014, respectively. He is currently a Postdoctoral Researcher working with Prof. Xin Liu at the Department of Computer Science, University of California, Davis. From December 2010 to January 2012, he was a visiting student at UC Davis, from May 2014 to August 2014, he was a visiting scholar at University of Illinois at Urbana-Champaign (UIUC), and from October 2012 to January 2014, he worked as a research intern at Wireless and Networking Group, Microsoft Research Asia (MSRA).

Xin Liu (University of California)

More from the Same Authors