Timezone: »
This paper underlines an elegant property of batch-normalization (BN): Successive batch normalizations with random linear updates make samples increasingly orthogonal. We establish a non-asymptotic characterization of the interplay between depth, width, and the orthogonality of deep representations. More precisely, we prove, under a mild assumption, the deviation of the representations from orthogonality rapidly decays with depth up to a term inversely proportional to the network width. This result has two main theoretical and practical implications: 1) Theoretically, as the depth grows, the distribution of the outputs contracts to a Wasserstein-2 ball around an isotropic normal distribution. Furthermore, the radius of this Wasserstein ball shrinks with the width of the network. 2) Practically, the orthogonality of the representations directly influences the performance of stochastic gradient descent (SGD). When representations are initially aligned, we observe SGD wastes many iterations to disentangle representations before the classification. Nevertheless, we experimentally show that starting optimization from orthogonal representations is sufficient to accelerate SGD, with no need for BN.
Author Information
Hadi Daneshmand (INRIA PARIS)
Amir Joudaki (Swiss Federal Institute of Technology)
Francis Bach (INRIA - Ecole Normale Superieure)
Related Events (a corresponding poster, oral, or spotlight)
-
2021 Poster: Batch Normalization Orthogonalizes Representations in Deep Random Networks »
Wed. Dec 8th 08:30 -- 10:00 AM Room
More from the Same Authors
-
2021 : PCA Subspaces Are Not Always Optimal for Bayesian Learning »
Alexandre Bense · Amir Joudaki · Tim G. J. Rudner · Vincent Fortuin -
2023 Poster: On the impact of activation and normalization in obtaining isometric embeddings at initialization »
Amir Joudaki · Hadi Daneshmand · Francis Bach -
2023 Poster: Transformers learn to implement preconditioned gradient descent for in-context learning »
Kwangjun Ahn · Xiang Cheng · Hadi Daneshmand · Suvrit Sra -
2021 Test Of Time: Online Learning for Latent Dirichlet Allocation »
Matthew Hoffman · Francis Bach · David Blei -
2021 Poster: Overcoming the curse of dimensionality with Laplacian regularization in semi-supervised learning »
Vivien Cabannes · Loucas Pillaud-Vivien · Francis Bach · Alessandro Rudi -
2021 Poster: Rethinking the Variational Interpretation of Accelerated Optimization Methods »
Peiyuan Zhang · Antonio Orvieto · Hadi Daneshmand -
2021 Oral: Continuized Accelerations of Deterministic and Stochastic Gradient Descents, and of Gossip Algorithms »
Mathieu Even · Raphaël Berthier · Francis Bach · Nicolas Flammarion · Hadrien Hendrikx · Pierre Gaillard · Laurent Massoulié · Adrien Taylor -
2021 Poster: Continuized Accelerations of Deterministic and Stochastic Gradient Descents, and of Gossip Algorithms »
Mathieu Even · Raphaël Berthier · Francis Bach · Nicolas Flammarion · Hadrien Hendrikx · Pierre Gaillard · Laurent Massoulié · Adrien Taylor -
2015 Poster: Rethinking LDA: Moment Matching for Discrete ICA »
Anastasia Podosinnikova · Francis Bach · Simon Lacoste-Julien -
2015 Poster: Spectral Norm Regularization of Orthonormal Representations for Graph Transduction »
Rakesh Shivanna · Bibaswan K Chatterjee · Raman Sankaran · Chiranjib Bhattacharyya · Francis Bach