Skip to yearly menu bar Skip to main content

Workshop: Adaptive Experimental Design and Active Learning in the Real World

Sequentially Adaptive Experimentation for Learning Optimal Options subject to Unobserved Contexts

Hongju Park · Mohamad Kazem Shirani Faradonbeh


Contextual bandits constitute a classical framework for interactive learning of best decisions subject to context information. In this setting, the goal is to sequentially learn arms of highest reward subject to the contextual information, while the unknown reward parameters of each arm need to be learned by experimenting it. Accordingly, a fundamental problem is that of balancing such experimentation (i.e., pulling different arms to learn the parameters), versus sticking with the best arm learned so far, in order to maximize rewards. To study this problem, the existing literature mostly considers perfectly observed contexts. However, the setting of partially observed contexts remains unexplored to date, despite being theoretically more general and practically more versatile. We study bandit policies for learning to select optimal arms based on observations, which are noisy linear functions of the unobserved context vectors. Our theoretical analysis shows that adaptive experiments based on samples from the posterior distribution efficiently learn optimal arms. Specifically, we establish regret bounds that grow logarithmically with time. Extensive simulations for real-world data are presented as well to illustrate this efficacy.

Chat is not available.