Loading [MathJax]/jax/output/CommonHTML/jax.js
Skip to yearly menu bar Skip to main content


Poster

Instance-optimal PAC Algorithms for Contextual Bandits

Zhaoqi Li · Lillian Ratliff · houssam nassif · Kevin Jamieson · Lalit Jain

Hall J (level 1) #839

Keywords: [ Reinforcement Learning ] [ contextual bandits ] [ Active Learning ]


Abstract: In the stochastic contextual bandit setting, regret-minimizing algorithms have been extensively researched, but their instance-minimizing best-arm identification counterparts remain seldom studied. In this work, we focus on the stochastic bandit problem in the (ϵ,δ)-PAC setting: given a policy class Π the goal of the learner is to return a policy πΠ whose expected reward is within ϵ of the optimal policy with probability greater than 1δ. We characterize the first instance-dependent PAC sample complexity of contextual bandits through a quantity ρΠ, and provide matching upper and lower bounds in terms of ρΠ for the agnostic and linear contextual best-arm identification settings. We show that no algorithm can be simultaneously minimax-optimal for regret minimization and instance-dependent PAC for best-arm identification. Our main result is a new instance-optimal and computationally efficient algorithm that relies on a polynomial number of calls to a cost-sensitive classification oracle.

Chat is not available.