NeurIPS Poster Corruption-Robust Offline Reinforcement Learning with General Function Approximation

Poster

Corruption-Robust Offline Reinforcement Learning with General Function Approximation

Chenlu Ye · Rui Yang · Quanquan Gu · Tong Zhang

Great Hall & Hall B1+B2 (level 1) #2014

[ Abstract ]

[ Paper] [ Poster] [ OpenReview]

Abstract: We investigate the problem of corruption robustness in offline reinforcement learning (RL) with general function approximation, where an adversary can corrupt each sample in the offline dataset, and the corruption level

ζ \geq 0

$\zeta\geq0$ quantifies the cumulative corruption amount over

n

$n$ episodes and

H

$H$ steps. Our goal is to find a policy that is robust to such corruption and minimizes the suboptimality gap with respect to the optimal policy for the uncorrupted Markov decision processes (MDPs). Drawing inspiration from the uncertainty-weighting technique from the robust online RL setting \citep{he2022nearly,ye2022corruptionrobust}, we design a new uncertainty weight iteration procedure to efficiently compute on batched samples and propose a corruption-robust algorithm for offline RL. Notably, under the assumption of single policy coverage and the knowledge of

ζ

$\zeta$ , our proposed algorithm achieves a suboptimality bound that is worsened by an additive factor of

O (ζ \cdot (C C (λ, \hat{F}, Z_{n}^{H}))^{1 / 2} (C (\hat{F}, μ))^{- 1 / 2} n^{- 1})

$\mathcal O(\zeta \cdot (\text CC(\lambda,\hat{\mathcal F},\mathcal Z_n^H))^{1/2} (C(\hat{\mathcal F},\mu))^{-1/2} n^{-1})$ due to the corruption. Here

C C (λ, \hat{F}, Z_{n}^{H})

$\text CC(\lambda,\hat{\mathcal F},\mathcal Z_n^H)$ is the coverage coefficient that depends on the regularization parameter

λ

$\lambda$ , the confidence set

\hat{F}

$\hat{\mathcal F}$ , and the dataset

Z_{n}^{H}

$\mathcal Z_n^H$ , and

C (\hat{F}, μ)

$C(\hat{\mathcal F},\mu)$ is a coefficient that depends on

\hat{F}

$\hat{\mathcal F}$ and the underlying data distribution

μ

$\mu$ . When specialized to linear MDPs, the corruption-dependent error term reduces to

O (ζ d n^{- 1})

$\mathcal O(\zeta d n^{-1})$ with

d

$d$ being the dimension of the feature map, which matches the existing lower bound for corrupted linear MDPs. This suggests that our analysis is tight in terms of the corruption-dependent term.

Chat is not available.