Neural Temporal-Difference Learning Converges to Global Optima
Qi Cai · Zhuoran Yang · Jason Lee · Zhaoran Wang

Wed Dec 11th 05:00 -- 07:00 PM @ East Exhibition Hall B + C #211

Temporal-difference learning (TD), coupled with neural networks, is among the most fundamental building blocks of deep reinforcement learning. However, due to the nonlinearity in value function approximation, such a coupling leads to nonconvexity and even divergence in optimization. As a result, the global convergence of neural TD remains unclear. In this paper, we prove for the first time that neural TD converges at a sublinear rate to the global optimum of the mean-squared projected Bellman error for policy evaluation. In particular, we show how such global convergence is enabled by the overparametrization of neural networks, which also plays a vital role in the empirical success of neural TD. Beyond policy evaluation, we establish the global convergence of neural (soft) Q-learning, which is further connected to that of policy gradient algorithms.

Author Information

Qi Cai (Northwestern University)
Zhuoran Yang (Princeton University)
Jason Lee (Princeton University)
Zhaoran Wang (Northwestern University)

More from the Same Authors