Timezone: »
Poster
Towards Instance-Optimal Offline Reinforcement Learning with Pessimism
Ming Yin · Yu-Xiang Wang
We study the \emph{offline reinforcement learning} (offline RL) problem, where the goal is to learn a reward-maximizing policy in an unknown \emph{Markov Decision Process} (MDP) using the data coming from a policy $\mu$. In particular, we consider the sample complexity problems of offline RL for the finite horizon MDPs. Prior works derive the information-theoretical lower bounds based on different data-coverage assumptions and their upper bounds are expressed by the covering coefficients which lack the explicit characterization of system quantities. In this work, we analyze the \emph{Adaptive Pessimistic Value Iteration} (APVI) algorithm and derive the suboptimality upper bound that nearly matches\[O\left(\sum_{h=1}^H\sum_{s_h,a_h}d^{\pi^\star}_h(s_h,a_h)\sqrt{\frac{\mathrm{Var}_{P_{s_h,a_h}}{(V^\star_{h+1}+r_h)}}{d^\mu_h(s_h,a_h)}}\sqrt{\frac{1}{n}}\right).\]We also prove an information-theoretical lower bound to show this quantity is required under the weak assumption that $d^\mu_h(s_h,a_h)>0$ if $d^{\pi^\star}_h(s_h,a_h)>0$. Here $\pi^\star$ is a optimal policy, $\mu$ is the behavior policy and $d(s_h,a_h)$ is the marginal state-action probability. We call this adaptive bound the \emph{intrinsic offline reinforcement learning bound} since it directly implies all the existing optimal results: minimax rate under uniform data-coverage assumption, horizon-free setting, single policy concentrability, and the tight problem-dependent results. Later, we extend the result to the \emph{assumption-free} regime (where we make no assumption on $\mu$) and obtain the assumption-free intrinsic bound. Due to its generic form, we believe the intrinsic bound could help illuminate what makes a specific problem hard and reveal the fundamental challenges in offline RL.
Author Information
Ming Yin (UC Santa Barbara)
Yu-Xiang Wang (UC Santa Barbara)
More from the Same Authors
-
2021 Spotlight: Logarithmic Regret in Feature-based Dynamic Pricing »
Jianyu Xu · Yu-Xiang Wang -
2021 : Instance-dependent Offline Reinforcement Learning: From tabular RL to linear MDPs »
Ming Yin · Yu-Xiang Wang -
2022 : Offline Reinforcement Learning with Closed-Form Policy Improvement Operators »
Jiachen Li · Edwin Zhang · Ming Yin · Qinxun Bai · Yu-Xiang Wang · William Yang Wang -
2021 Poster: Privately Publishable Per-instance Privacy »
Rachel Redberg · Yu-Xiang Wang -
2021 Poster: Logarithmic Regret in Feature-based Dynamic Pricing »
Jianyu Xu · Yu-Xiang Wang -
2021 Poster: Optimal Uniform OPE and Model-based Offline Reinforcement Learning in Time-Homogeneous, Reward-Free and Task-Agnostic Settings »
Ming Yin · Yu-Xiang Wang -
2021 Poster: Near-Optimal Offline Reinforcement Learning via Double Variance Reduction »
Ming Yin · Yu Bai · Yu-Xiang Wang -
2017 Poster: Higher-Order Total Variation Classes on Grids: Minimax Theory and Trend Filtering Methods »
Veeranjaneyulu Sadhanala · Yu-Xiang Wang · James Sharpnack · Ryan Tibshirani -
2016 : Optimal and Adaptive Off-policy Evaluation in Contextual Bandits »
Yu-Xiang Wang -
2016 Poster: Total Variation Classes Beyond 1d: Minimax Rates, and the Limitations of Linear Smoothers »
Veeranjaneyulu Sadhanala · Yu-Xiang Wang · Ryan Tibshirani -
2015 : Yu-Xiang Wang: Learning with differential privacy: stability, learnability and the sufficiency and necessity of ERM principle »
Yu-Xiang Wang -
2015 Poster: Differentially private subspace clustering »
Yining Wang · Yu-Xiang Wang · Aarti Singh