Poster
in
Workshop: OPT 2022: Optimization for Machine Learning
Neural Networks Efficiently Learn Low-Dimensional Representations with SGD
Alireza Mousavi-Hosseini · Sejun Park · Manuela Girotti · Ioannis Mitliagkas · Murat Erdogdu
Abstract:
We study the problem of training a two-layer neural network (NN) of arbitrary width using stochastic gradient descent (SGD) where the input x∈Rd is Gaussian and the target y∈R follows a multiple-index model, i.e., y=g(⟨u1,x⟩,...,⟨uk,x⟩) with a noisy link function g. We prove that the first-layer weights of the NN converge to the k-dimensional principal subspace spanned by the vectors u1,...,uk of the true model, when online SGD with weight decay is used for training. This phenomenon has several important consequences when k≪d. First, by employing uniform convergence on this smaller subspace, we establish a generalization error bound of O(√kd/T) after T iterations of SGD, which is independent of the width of the NN. We further demonstrate that, SGD-trained ReLU NNs can learn a single-index target of the form y=f(⟨u,x⟩)+ϵ by recovering the principal direction, with a sample complexity linear in d (up to log factors), where f is a monotonic function with at most polynomial growth, and ϵ is the noise. This is in contrast to the known dΩ(p) sample requirement to learn any degree p polynomial in the kernel regime, and it shows that NNs trained with SGD can outperform the neural tangent kernel at initialization.
Chat is not available.