Pure Exploration via Frank--Wolfe Self-Play
Abstract
We study pure exploration in structured bandits through a game-theoretic lens. For a broad class of queries, the asymptotic instance complexity admits a maximin form that corresponds to a two-player zero-sum game: an experimenter who allocates measurements to rule out alternatives versus a skeptic who proposes them. Allowing the skeptic to mix turns this into a concave–convex saddle-point problem. This reformulation unlocks a simple, projection-free, tuning-free, bandit-friendly algorithm: Frank–Wolfe Self-Play, in which both players take one-hot Frank–Wolfe steps against the other's empirical mixed strategy. Our analysis proceeds via continuous-time dynamics: the limiting differential inclusion admits a Lyapunov function that decays exponentially, yielding a vanishing duality gap and convergence of the game value. We then embed the discrete updates into a perturbed flow and, using stochastic-approximation techniques, show that the game value converges. A linear-bandit case study highlights phenomena absent in unstructured settings—nonunique optima, optimal allocations with zero mass on the best arm, bilinear objectives, and boundary nonsmoothness—and shows how Frank–Wolfe Self-Play automatically discovers the correct solution where naive top-two/Thompson-style heuristics may fail. Overall, we offers a minimal modification to the classical Frank–Wolfe algorithm, which is provably optimal for pure exploration.