Abstract:
The current paper studies the problem of agnostic -learning with function approximation in deterministic systems where the optimal -function is approximable by a function in the class with approximation error . We propose a novel recursion-based algorithm and show that if , then one can find the optimal policy using trajectories, where is the gap between the optimal -value of the best actions and that of the second-best actions and is the Eluder dimension of . Our result has two implications:
Therefore, we help address the open problem on agnostic -learning proposed in [Wen and Van Roy, 2013]. We further extend our algorithm to the stochastic reward setting and obtain similar results.
Chat is not available.