Global Convergence of Gradient Descent for Deep Linear Residual Networks
Lei Wu · Qingcan Wang · Chao Ma
Keywords:
Optimization for Deep Networks
Deep Learning
Optimization -> Non-Convex Optimization; Theory -> Computational Complexity; Theory
Learning Theory
2019 Poster
Abstract
We analyze the global convergence of gradient descent for deep linear residual
networks by proposing a new initialization: zero-asymmetric (ZAS)
initialization. It is motivated by avoiding stable manifolds of saddle points.
We prove that under the ZAS initialization, for an arbitrary target matrix,
gradient descent converges to an $\varepsilon$-optimal point in $O\left( L^3
\log(1/\varepsilon) \right)$ iterations, which scales polynomially with the
network depth $L$. Our result and the $\exp(\Omega(L))$ convergence time for the
standard initialization (Xavier or near-identity)
\cite{shamir2018exponential} together demonstrate the importance of the
residual structure and the initialization in the optimization for deep linear
neural networks, especially when $L$ is large.
Chat is not available.
Successful Page Load