Timezone: »
Poster
Convergence Analysis of Two-layer Neural Networks with ReLU Activation
Yuanzhi Li · Yang Yuan
In recent years, stochastic gradient descent (SGD) based techniques has become the standard tools for training neural networks. However, formal theoretical understanding of why SGD can train neural networks in practice is largely missing. In this paper, we make progress on understanding this mystery by providing a convergence analysis for SGD on a rich subset of two-layer feedforward networks with ReLU activations. This subset is characterized by a special structure called "identity mapping". We prove that, if input follows from Gaussian distribution, with standard $O(1/\sqrt{d})$ initialization of the weights, SGD converges to the global minimum in polynomial number of steps. Unlike normal vanilla networks, the "identity mapping" makes our network asymmetric and thus the global minimum is unique. To complement our theory, we are also able to show experimentally that multi-layer networks with this mapping have better performance compared with normal vanilla networks. Our convergence theorem differs from traditional non-convex optimization techniques. We show that SGD converges to optimal in "two phases": In phase I, the gradient points to the wrong direction, however, a potential function $g$ gradually decreases. Then in phase II, SGD enters a nice one point convex region and converges. We also show that the identity mapping is necessary for convergence, as it moves the initial point to a better place for optimization. Experiment verifies our claims.
Author Information
Yuanzhi Li (Princeton University)
Yang Yuan (Cornell University)
More from the Same Authors
-
2017 Poster: Linear Convergence of a Frank-Wolfe Type Algorithm over Trace-Norm Balls »
Zeyuan Allen-Zhu · Elad Hazan · Wei Hu · Yuanzhi Li -
2017 Spotlight: Linear Convergence of a Frank-Wolfe Type Algorithm over Trace-Norm Balls »
Zeyuan Allen-Zhu · Elad Hazan · Wei Hu · Yuanzhi Li -
2016 Poster: Approximate maximum entropy principles via Goemans-Williamson with applications to provable variational methods »
Andrej Risteski · Yuanzhi Li -
2016 Poster: Recovery Guarantee of Non-negative Matrix Factorization via Alternating Updates »
Yuanzhi Li · Yingyu Liang · Andrej Risteski -
2016 Poster: Even Faster SVD Decomposition Yet Without Agonizing Pain »
Zeyuan Allen-Zhu · Yuanzhi Li -
2016 Poster: Algorithms and matching lower bounds for approximately-convex optimization »
Andrej Risteski · Yuanzhi Li