NeurIPS Poster Improved Regret Analysis for Variance-Adaptive Linear Bandits and Horizon-Free Linear Mixture MDPs

Poster

Improved Regret Analysis for Variance-Adaptive Linear Bandits and Horizon-Free Linear Mixture MDPs

Yeoneung Kim · Insoon Yang · Kwang-Sung Jun

Hall J (level 1) #826

[ Abstract ]

[ Paper] [ Poster] [ OpenReview]

Abstract: In online learning problems, exploiting low variance plays an important role in obtaining tight performance guarantees yet is challenging because variances are often not known a priori. Recently, considerable progress has been made by Zhang et al. (2021) where they obtain a variance-adaptive regret bound for linear bandits without knowledge of the variances and a horizon-free regret bound for linear mixture Markov decision processes (MDPs). In this paper, we present novel analyses that improve their regret bounds significantly. For linear bandits, we achieve

~ O (min {d \sqrt{K}, d^{1.5} \sqrt{\sum_{k = 1}^{K} σ_{k}^{2}}} + d^{2})

$\tilde O(\min\{d\sqrt{K}, d^{1.5}\sqrt{\sum_{k=1}^K \sigma_k^2}\} + d^2)$ where

d

$d$ is the dimension of the features,

K

$K$ is the time horizon, and

σ_{k}^{2}

$\sigma_k^2$ is the noise variance at time step

k

$k$ , and

~ O

$\tilde O$ ignores polylogarithmic dependence, which is a factor of

d^{3}

$d^3$ improvement. For linear mixture MDPs with the assumption of maximum cumulative reward in an episode being in

[0, 1]

$[0,1]$ , we achieve a horizon-free regret bound of

~ O (d \sqrt{K} + d^{2})

$\tilde O(d \sqrt{K} + d^2)$ where

d

$d$ is the number of base models and

K

$K$ is the number of episodes. This is a factor of

d^{3.5}

$d^{3.5}$ improvement in the leading term and

d^{7}

$d^7$ in the lower order term. Our analysis critically relies on a novel peeling-based regret analysis that leverages the elliptical potential `count' lemma.

Chat is not available.