On the Convergence of GTD($\lambda$) with General $\lambda$
Huizhen Yu
2019 Plenary Talk
in
Workshop: The Optimization Foundations of Reinforcement Learning
in
Workshop: The Optimization Foundations of Reinforcement Learning
Abstract
Gradient temporal-difference (GTD) algorithms are an important family of extensions of the standard TD algorithm, overcoming the stability issues in off-policy learning with function approximation by minimizing the projected Bellman error using stochastic gradient descent (SGD). In this talk, we consider GTD($\lambda$) for policy evaluation in the case of linear function approximation and finite-space Markov decision processes (MDPs) with discounted rewards. GTD($\lambda$) differs from the typical SGD algorithm for minimizing an objective function, in the following way. In an MDP, each stationary policy is associated with a family of Bellman equations. By selecting the values of the eligibility trace parameters, $\lambda$, we determine the Bellman operator that defines the objective function. This choice influences both the approximation error and learning behavior of the resulting algorithm. We first describe a dynamic scheme to judiciously set $\lambda$ in a history-dependent way. This scheme is based on the idea of randomized stopping times and generalized Bellman equations, and it is useful for balancing the bias-variance tradeoff in off-policy learning. We then present asymptotic convergence results for several variations of GTD($\lambda$), including saddle-point and mirror-descent TD variants, for different choices of $\lambda$: constant, state-dependent, and history-dependent. These convergence results are obtained by combining special properties of the joint state and eligibility trace process in TD learning with ergodic theory for weak Feller Markov chains, mean-ODE-based proof methods, and convex optimization theory. Our results not only resolve long-standing open questions regarding the convergence of GTD($\lambda$) but also provide the basis for using a broader class of generalized Bellman operators for approximate policy evaluation with linear TD methods.
Chat is not available.
Successful Page Load