The Explore-Exploit Tradeoff Redefined: Balancing Regret and Treatment Effects in Contextual Bandits
Abstract
Estimating average treatment effects (ATE) underlies hypothesis testing in observational studies, and is classically resolved with randomized control trials combined with a direct estimator for the difference of average rewards. Doing so mandates that null and alternative hypotheses receive equitable sampling allocations. This can incur undue cost in the setting that one hypothesis is superior in terms of average reward, which is compounded in the case of multiple comparison groups. On the other hand, bandit algorithms prioritize the most promising arm among a collection of alternatives by ranking in terms of average rewards, but reason about uncertainty only in pursuit of achieving small regret. That is, there is no certificate that the significance or power of a bandit algorithm to conclude the correct hypothesis matches a target level. In this work, we derive explicit tradeoffs between regret and the quality of ATE estimates in terms of its consistency and variance. Our core contribution is to extend contextual bandit algorithms based on inverse gap weighting (IGW), which are attuned to this setting through their explicit control on probability of arm selection, to prioritize exploration in a way that permits convergence guarantees on induced ATE estimators. In that way, we provide an algorithm that allows for optimal adaptive hypothesis testing concurrent with the operation of a sub-linear regret contextual bandit algorithm.