NeurIPS Poster Improved Variance-Aware Confidence Sets for Linear Bandits and Linear Mixture MDP

Poster

Improved Variance-Aware Confidence Sets for Linear Bandits and Linear Mixture MDP

Zihan Zhang · Jiaqi Yang · Xiangyang Ji · Simon Du

Keywords: [ Bandits ] [ Theory ] [ Reinforcement Learning and Planning ]

[ Abstract ]

[ OpenReview]

Abstract: This paper presents new \emph{variance-aware} confidence sets for linear bandits and linear mixture Markov Decision Processes (MDPs).With the new confidence sets, we obtain the follow regret bounds:For linear bandits, we obtain an

˜ O (p o l y (d) \sqrt{1 + \sum_{k = 1}^{K} σ_{k}^{2}})

$\widetilde{O}(\mathrm{poly}(d)\sqrt{1 + \sum_{k=1}^{K}\sigma_k^2})$ data-dependent regret bound, where

d

$d$ is the feature dimension,

K

$K$ is the number of rounds, and

σ_{k}^{2}

$\sigma_k^2$ is the \emph{unknown} variance of the reward at the

k

$k$ -th round. This is the first regret bound that only scales with the variance and the dimension but \emph{no explicit polynomial dependency on

K

$K$ }.When variances are small, this bound can be significantly smaller than the

˜ Θ (d \sqrt{K})

$\widetilde{\Theta}\left(d\sqrt{K}\right)$ worst-case regret bound.For linear mixture MDPs, we obtain an

˜ O (p o l y (d, log H) \sqrt{K})

$\widetilde{O}(\mathrm{poly}(d, \log H)\sqrt{K})$ regret bound, where

d

$d$ is the number of base models,

K

$K$ is the number of episodes, and

H

$H$ is the planning horizon. This is the first regret bound that only scales \emph{logarithmically} with

H

$H$ in the reinforcement learning with linear function approximation setting, thus \emph{exponentially improving} existing results, and resolving an open problem in \citep{zhou2020nearly}.We develop three technical ideas that may be of independent interest:1) applications of the peeling technique to both the input norm and the variance magnitude, 2) a recursion-based estimator for the variance, and 3) a new convex potential lemma that generalizes the seminal elliptical potential lemma.

Chat is not available.