Track: Track 3 Session 5

Thu 12 Dec. 10:05 - 10:20 PST

Oral

Exponentially convergent stochastic k-PCA without variance reduction

Cheng Tang

We present Matrix Krasulina, an algorithm for online k-PCA, by gen- eralizing the classic Krasulina’s method (Krasulina, 1969) from vector to matrix case. We show, both theoretically and empirically, that the algorithm naturally adapts to data low-rankness and converges exponentially fast to the ground-truth principal subspace. Notably, our result suggests that despite various recent efforts to accelerate the convergence of stochastic-gradient based methods by adding a O(n)-time variance reduction step, for the k- PCA problem, a truly online SGD variant suffices to achieve exponential convergence on intrinsically low-rank data.

Thu 12 Dec. 10:20 - 10:25 PST

Spotlight

Implicit Regularization in Deep Matrix Factorization

Sanjeev Arora · Nadav Cohen · Wei Hu · Yuping Luo

Efforts to understand the generalization mystery in deep learning have led to the belief that gradient-based optimization induces a form of implicit regularization, a bias towards models of low "complexity." We study the implicit regularization of gradient descent over deep linear neural networks for matrix completion and sensing, a model referred to as deep matrix factorization. Our first finding, supported by theory and experiments, is that adding depth to a matrix factorization enhances an implicit tendency towards low-rank solutions, oftentimes leading to more accurate recovery. Secondly, we present theoretical and empirical arguments questioning a nascent view by which implicit regularization in matrix factorization can be captured using simple mathematical norms. Our results point to the possibility that the language of standard regularizers may not be rich enough to fully encompass the implicit regularization brought forth by gradient-based optimization.

Thu 12 Dec. 10:25 - 10:30 PST

Spotlight

SGD on Neural Networks Learns Functions of Increasing Complexity

Dimitris Kalimeris · Gal Kaplun · Preetum Nakkiran · Benjamin Edelman · Tristan Yang · Boaz Barak · Haofeng Zhang

We perform an experimental study of the dynamics of Stochastic Gradient Descent (SGD) in learning deep neural networks for several real and synthetic classification tasks. We show that in the initial epochs, almost all of the performance improvement of the classifier obtained by SGD can be explained by a linear classifier. More generally, we give evidence for the hypothesis that, as iterations progress, SGD learns functions of increasing complexity. This hypothesis can be helpful in explaining why SGD-learned classifiers tend to generalize well even in the over-parameterized regime. We also show that the linear classifier learned in the initial stages is ``retained'' throughout the execution even if training is continued to the point of zero training error, and complement this with a theoretical result in a simplified model. Key to our work is a new measure of how well one classifier explains the performance of another, based on conditional mutual information.

Thu 12 Dec. 10:30 - 10:35 PST

Spotlight

When does label smoothing help?

Rafael Müller · Simon Kornblith · Geoffrey E Hinton

The generalization and learning speed of a multi-class neural network can often be significantly improved by using soft targets that are a weighted average of the hard targets and the uniform distribution over labels. Smoothing the labels in this way prevents the network from becoming over-confident and label smoothing has been used in many state-of-the-art models, including image classification, language translation and speech recognition. Despite its widespread use, label smoothing is still poorly understood. Here we show empirically that in addition to improving generalization, label smoothing improves model calibration which can significantly improve beam search. However, we also observe that if a teacher network is trained with label smoothing, knowledge distillation into a student network is much less effective. To explain these observations, we visualize how label smoothing changes the representations learned by the penultimate layer of the network. We show that label smoothing encourages the representations of training examples from the same class to group in tight clusters. This results in loss of information in the logits about resemblances between instances of different classes, which is necessary for distillation, but does not hurt generalization or calibration of the model's predictions.

Thu 12 Dec. 10:35 - 10:40 PST

Spotlight

Splitting Steepest Descent for Growing Neural Architectures

Lemeng Wu · Dilin Wang · Qiang Liu

We develop a progressive training approach for neural networks which adaptively grows the network structure by splitting existing neurons to multiple off-springs. By leveraging a functional steepest descent idea, we derive a simple criterion for deciding the best subset of neurons to split and a \emph{splitting gradient} for optimally updating the off-springs. Theoretically, our splitting strategy is a second order functional steepest descent for escaping saddle points in an $\Linfty$-Wasserstein metric space, on which the standard parametric gradient descent is a first-order steepest descent. Our method provides a new computationally efficient approach for optimizing neural network structures, especially for learning lightweight neural architectures in resource-constrained settings.

Thu 12 Dec. 10:40 - 10:45 PST

Spotlight

Positional Normalization

Boyi Li · Felix Wu · Kilian Weinberger · Serge Belongie

A widely deployed method for reducing the training time of deep neural networks is to normalize activations at each layer. Although various normalization schemes have been proposed, they all follow a common theme: normalize across spatial dimensions and discard the extracted statistics. In this paper, we propose a novel normalization method that deviates from this theme. Our approach, which we refer to as Positional Normalization (PONO), normalizes exclusively across channels, which allows us to capture structural information of the input image in the first and second moments. Instead of disregarding this information, we inject it into later layers to preserve or transfer structural information in generative networks. We show that PONO significantly improves the performance of deep networks across a wide range of model architectures and image generation tasks.