NeurIPS 2024 Statistical Efficiency of Distributional Temporal Difference Learning Oral

Oral

Statistical Efficiency of Distributional Temporal Difference Learning

Yang Peng · Liangyu Zhang · Zhihua Zhang

East Meeting Room 1-3

[ Abstract ] [ Visit Oral Session 2C: Reinforcement Learning ]

Wed 11 Dec 3:50 p.m. — 4:10 p.m. PST

[ OpenReview]

Abstract: Distributional reinforcement learning (DRL) has achieved empirical success in various domains.One of the core tasks in the field of DRL is distributional policy evaluation, which involves estimating the return distribution

$\eta^\pi$ for a given policy

$\pi$ .The distributional temporal difference learning has been accordingly proposed, whichis an extension of the temporal difference learning (TD) in the classic RL area.In the tabular case, Rowland et al. [2018] and Rowland et al. [2023] proved the asymptotic convergence of two instances of distributional TD, namely categorical temporal difference learning (CTD) and quantile temporal difference learning (QTD), respectively.In this paper, we go a step further and analyze the finite-sample performance of distributional TD.To facilitate theoretical analysis, we propose a non-parametric distributional TD learning (NTD).For a

$\gamma$ -discounted infinite-horizon tabular Markov decision process,we show that for NTD we need

$\widetilde O\left(\frac{1}{\varepsilon^{2p}(1-\gamma)^{2p+1}}\right)$ iterations to achieve an

$\varepsilon$ -optimal estimator with high probability, when the estimation error is measured by the

$p$ -Wasserstein distance.This sample complexity bound is minimax optimal (up to logarithmic factors) in the case of the

$1$ -Wasserstein distance.To achieve this, we establish a novel Freedman's inequality in Hilbert spaces, which would be of independent interest.In addition, we revisit CTD, showing that the same non-asymptotic convergence bounds hold for CTD in the case of the

$p$ -Wasserstein distance.

Chat is not available.