NeurIPS Poster Global Convergence of Gradient Descent for Deep Linear Residual Networks

Poster

Global Convergence of Gradient Descent for Deep Linear Residual Networks

Lei Wu · Qingcan Wang · Chao Ma

East Exhibition Hall B, C #201

Keywords: [ Optimization for Deep Networks ] [ Deep Learning ] [ Optimization -> Non-Convex Optimization; Theory -> Computational Complexity; Theory ] [ Learning Theory ]

[ Abstract ]

Abstract: We analyze the global convergence of gradient descent for deep linear residual networks by proposing a new initialization: zero-asymmetric (ZAS) initialization. It is motivated by avoiding stable manifolds of saddle points. We prove that under the ZAS initialization, for an arbitrary target matrix, gradient descent converges to an

ε

$\varepsilon$ -optimal point in

O (L^{3} \log (1 / ε))

$O\left( L^3 \log(1/\varepsilon) \right)$ iterations, which scales polynomially with the network depth

L

$L$ . Our result and the

\exp (Ω (L))

$\exp(\Omega(L))$ convergence time for the standard initialization (Xavier or near-identity) \cite{shamir2018exponential} together demonstrate the importance of the residual structure and the initialization in the optimization for deep linear neural networks, especially when

L

$L$ is large.

Live content is unavailable. Log in and register to view live content