Timezone: »
Poster
Implicit Regularization and Convergence for Weight Normalization
Xiaoxia Wu · Edgar Dobriban · Tongzheng Ren · Shanshan Wu · Zhiyuan Li · Suriya Gunasekar · Rachel Ward · Qiang Liu
Normalization methods such as batch, weight, instance, and layer normalization are commonly used in modern machine learning. Here, we study the weight normalization (WN) method \cite{salimans2016weight} and a variant called reparametrized projected gradient descent (rPGD) for overparametrized least squares regression and some more general loss functions. WN and rPGD reparametrize the weights with a scale $g$ and a unit vector such that the objective function becomes \emph{non-convex}. We show that this non-convex formulation has beneficial regularization effects compared to gradient descent on the original objective. These methods adaptively regularize the weights and \emph{converge linearly} close to the minimum $\ell_2$ norm solution even for initializations far from zero. For certain two-phase variants, they can converge to the min norm solution. This is different from the behavior of gradient descent, which only converges to the min norm solution when started at zero, and thus more sensitive to initialization.
Author Information
Xiaoxia Wu (The University of Texas at Austin)
Edgar Dobriban (University of Pennsylvania)
Tongzheng Ren (UT Austin)
Shanshan Wu (University of Texas at Austin)
Here is my homepage: http://wushanshan.github.io/
Zhiyuan Li (Princeton University)
Suriya Gunasekar (Microsoft Research Redmond)
Rachel Ward (UT Austin)
Qiang Liu (Dartmouth College)
More from the Same Authors
-
2021 Spotlight: Bootstrapping the Error of Oja's Algorithm »
Robert Lunde · Purnamrita Sarkar · Rachel Ward -
2021 Spotlight: Profiling Pareto Front With Multi-Objective Stein Variational Gradient Descent »
Xingchao Liu · Xin Tong · Qiang Liu -
2022 Poster: Fair Bayes-Optimal Classifiers Under Predictive Parity »
Xianli Zeng · Edgar Dobriban · Guang Cheng -
2022 : How Sharpness-Aware Minimization Minimizes Sharpness? »
Kaiyue Wen · Tengyu Ma · Zhiyuan Li -
2022 : How Sharpness-Aware Minimization Minimizes Sharpness? »
Kaiyue Wen · Tengyu Ma · Zhiyuan Li -
2022 : BOME! Bilevel Optimization Made Easy: A Simple First-Order Approach »
Mao Ye · Bo Liu · Stephen Wright · Peter Stone · Qiang Liu -
2022 : Diffusion-based Molecule Generation with Informative Prior Bridges »
Chengyue Gong · Lemeng Wu · Xingchao Liu · Mao Ye · Qiang Liu -
2022 : HotProtein: A Novel Framework for Protein Thermostability Prediction and Editing »
Tianlong Chen · Chengyue Gong · Daniel Diaz · Xuxi Chen · Jordan Wells · Qiang Liu · Zhangyang Wang · Andrew Ellington · Alex Dimakis · Adam Klivans -
2022 : First hitting diffusion models »
Mao Ye · Lemeng Wu · Qiang Liu -
2022 : Neural Volumetric Mesh Generator »
Yan Zheng · Lemeng Wu · Xingchao Liu · Zhen Chen · Qiang Liu · Qixing Huang -
2022 : Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow »
Xingchao Liu · Chengyue Gong · Qiang Liu -
2022 : Let us Build Bridges: Understanding and Extending Diffusion Generative Models »
Xingchao Liu · Lemeng Wu · Mao Ye · Qiang Liu -
2023 Poster: FAMO: Fast Adaptive Multitask Optimization »
Bo Liu · Yihao Feng · Peter Stone · Qiang Liu -
2023 Poster: (S)GD over Diagonal Linear Networks: Implicit bias, Large Stepsizes and Edge of Stability »
Mathieu Even · Scott Pesme · Suriya Gunasekar · Nicolas Flammarion -
2023 Poster: Wasserstein Quantum Monte Carlo: A Novel Approach for Solving the Quantum Many-Body Schrödinger Equation »
Kirill Neklyudov · Jannes Nys · Luca Thiede · Juan Carrasquilla · Qiang Liu · Max Welling · Alireza Makhzani -
2023 Poster: Convergence of Alternating Gradient Descent for Matrix Factorization »
Rachel Ward · Tamara Kolda -
2023 Poster: Nearly Optimal Bounds for Cyclic Forgetting »
William Swartworth · Deanna Needell · Rachel Ward · Mark Kong · Halyun Jeong -
2023 Poster: Cluster-aware Semi-supervised Learning: Relational Knowledge Distillation Provably Learns Clustering »
Yijun Dong · Kevin Miller · Qi Lei · Rachel Ward -
2023 Poster: Markovian Sliced Wasserstein Distances: Beyond Independent Projections »
Khai Nguyen · Tongzheng Ren · Nhat Ho -
2023 Poster: Designing Robust Transformers using Robust Kernel Density Estimation »
Xing Han · Tongzheng Ren · Tan Nguyen · Khai Nguyen · Joydeep Ghosh · Nhat Ho -
2023 Poster: LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning »
Bo Liu · Yifeng Zhu · Chongkai Gao · Yihao Feng · Qiang Liu · Yuke Zhu · Peter Stone -
2022 : Contributed Talks 3 »
Cristóbal Guzmán · Fangshuo Liao · Vishwak Srinivasan · Zhiyuan Li -
2022 Poster: PAC Prediction Sets for Meta-Learning »
Sangdon Park · Edgar Dobriban · Insup Lee · Osbert Bastani -
2022 Poster: First Hitting Diffusion Models for Generating Manifold, Graph and Categorical Data »
Mao Ye · Lemeng Wu · Qiang Liu -
2022 Poster: Collaborative Learning of Discrete Distributions under Heterogeneity and Communication Constraints »
Xinmeng Huang · Donghwan Lee · Edgar Dobriban · Hamed Hassani -
2022 Poster: Sampling in Constrained Domains with Orthogonal-Space Variational Gradient Descent »
Ruqi Zhang · Qiang Liu · Xin Tong -
2022 Poster: Implicit Bias of Gradient Descent on Reparametrized Models: On Equivalence to Mirror Descent »
Zhiyuan Li · Tianhao Wang · Jason Lee · Sanjeev Arora -
2022 Poster: BOME! Bilevel Optimization Made Easy: A Simple First-Order Approach »
Bo Liu · Mao Ye · Stephen Wright · Peter Stone · Qiang Liu -
2022 Poster: Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction »
Kaifeng Lyu · Zhiyuan Li · Sanjeev Arora -
2022 Poster: Fast Mixing of Stochastic Gradient Descent with Normalization and Weight Decay »
Zhiyuan Li · Tianhao Wang · Dingli Yu -
2022 Poster: Diffusion-based Molecule Generation with Informative Prior Bridges »
Lemeng Wu · Chengyue Gong · Xingchao Liu · Mao Ye · Qiang Liu -
2021 Poster: On the Validity of Modeling SGD with Stochastic Differential Equations (SDEs) »
Zhiyuan Li · Sadhika Malladi · Sanjeev Arora -
2021 Poster: Conflict-Averse Gradient Descent for Multi-task learning »
Bo Liu · Xingchao Liu · Xiaojie Jin · Peter Stone · Qiang Liu -
2021 Poster: Sampling with Trusthworthy Constraints: A Variational Gradient Framework »
Xingchao Liu · Xin Tong · Qiang Liu -
2021 Poster: Federated Reconstruction: Partially Local Federated Learning »
Karan Singhal · Hakim Sidahmed · Zachary Garrett · Shanshan Wu · John Rush · Sushant Prakash -
2021 Poster: Scalable Quasi-Bayesian Inference for Instrumental Variable Regression »
Ziyu Wang · Yuhao Zhou · Tongzheng Ren · Jun Zhu -
2021 Poster: Gradient Descent on Two-layer Nets: Margin Maximization and Simplicity Bias »
Kaifeng Lyu · Zhiyuan Li · Runzhe Wang · Sanjeev Arora -
2021 Poster: Bootstrapping the Error of Oja's Algorithm »
Robert Lunde · Purnamrita Sarkar · Rachel Ward -
2021 Poster: Automatic and Harmless Regularization with Constrained and Lexicographic Optimization: A Dynamic Barrier Approach »
Chengyue Gong · Xingchao Liu · Qiang Liu -
2021 Poster: argmax centroid »
Chengyue Gong · Mao Ye · Qiang Liu -
2021 Poster: Profiling Pareto Front With Multi-Objective Stein Variational Gradient Descent »
Xingchao Liu · Xin Tong · Qiang Liu -
2021 Poster: Nearly Horizon-Free Offline Reinforcement Learning »
Tongzheng Ren · Jialian Li · Bo Dai · Simon Du · Sujay Sanghavi -
2020 : Rachel Ward »
Rachel Ward -
2020 : Invited speaker: Concentration for matrix products, and convergence of Oja’s algorithm for streaming PCA, Rachel Ward »
Rachel Ward -
2020 Poster: Stein Self-Repulsive Dynamics: Benefits From Past Samples »
Mao Ye · Tongzheng Ren · Qiang Liu -
2020 Poster: Reconciling Modern Deep Learning with Traditional Optimization Analyses: The Intrinsic Learning Rate »
Zhiyuan Li · Kaifeng Lyu · Sanjeev Arora -
2020 Poster: Optimal Iterative Sketching Methods with the Subsampled Randomized Hadamard Transform »
Jonathan Lacotte · Sifan Liu · Edgar Dobriban · Mert Pilanci -
2020 Poster: A Group-Theoretic Framework for Data Augmentation »
Shuxiao Chen · Edgar Dobriban · Jane Lee -
2020 Poster: Implicit Bias in Deep Linear Classification: Initialization Scale vs Training Accuracy »
Edward Moroshko · Blake Woodworth · Suriya Gunasekar · Jason Lee · Nati Srebro · Daniel Soudry -
2020 Spotlight: Implicit Bias in Deep Linear Classification: Initialization Scale vs Training Accuracy »
Edward Moroshko · Blake Woodworth · Suriya Gunasekar · Jason Lee · Nati Srebro · Daniel Soudry -
2020 Oral: A Group-Theoretic Framework for Data Augmentation »
Shuxiao Chen · Edgar Dobriban · Jane Lee -
2019 : Break / Poster Session 1 »
Antonia Marcu · Yao-Yuan Yang · Pascale Gourdeau · Chen Zhu · Thodoris Lykouris · Jianfeng Chi · Mark Kozdoba · Arjun Nitin Bhagoji · Xiaoxia Wu · Jay Nandy · Michael T Smith · Bingyang Wen · Yuege Xie · Konstantinos Pitas · Suprosanna Shit · Maksym Andriushchenko · Dingli Yu · Gaël Letarte · Misha Khodak · Hussein Mozannar · Chara Podimata · James Foulds · Yizhen Wang · Huishuai Zhang · Ondrej Kuzelka · Alexander Levine · Nan Lu · Zakaria Mhammedi · Paul Viallard · Diana Cai · Lovedeep Gondara · James Lucas · Yasaman Mahdaviyeh · Aristide Baratin · Rishi Bommasani · Alessandro Barp · Andrew Ilyas · Kaiwen Wu · Jens Behrmann · Omar Rivasplata · Amir Nazemi · Aditi Raghunathan · Will Stephenson · Sahil Singla · Akhil Gupta · YooJung Choi · Yannic Kilcher · Clare Lyle · Edoardo Manino · Andrew Bennett · Zhi Xu · Niladri Chatterji · Emre Barut · Flavien Prost · Rodrigo Toro Icarte · Arno Blaas · Chulhee Yun · Sahin Lale · YiDing Jiang · Tharun Kumar Reddy Medini · Ashkan Rezaei · Alexander Meinke · Stephen Mell · Gary Kazantsev · Shivam Garg · Aradhana Sinha · Vishnu Lokhande · Geovani Rizk · Han Zhao · Aditya Kumar Akash · Jikai Hou · Ali Ghodsi · Matthias Hein · Tyler Sypherd · Yichen Yang · Anastasia Pentina · Pierre Gillot · Antoine Ledent · Guy Gur-Ari · Noah MacAulay · Tianzong Zhang -
2019 Poster: Explaining Landscape Connectivity of Low-cost Solutions for Multilayer Nets »
Rohith Kuditipudi · Xiang Wang · Holden Lee · Yi Zhang · Zhiyuan Li · Wei Hu · Rong Ge · Sanjeev Arora -
2019 Poster: Sparse Logistic Regression Learns All Discrete Pairwise Graphical Models »
Shanshan Wu · Sujay Sanghavi · Alex Dimakis -
2019 Spotlight: Sparse Logistic Regression Learns All Discrete Pairwise Graphical Models »
Shanshan Wu · Sujay Sanghavi · Alex Dimakis -
2019 Poster: Learning Distributions Generated by One-Layer ReLU Networks »
Shanshan Wu · Alex Dimakis · Sujay Sanghavi -
2019 Poster: On Exact Computation with an Infinitely Wide Neural Net »
Sanjeev Arora · Simon Du · Wei Hu · Zhiyuan Li · Russ Salakhutdinov · Ruosong Wang -
2019 Spotlight: On Exact Computation with an Infinitely Wide Neural Net »
Sanjeev Arora · Simon Du · Wei Hu · Zhiyuan Li · Russ Salakhutdinov · Ruosong Wang -
2018 Poster: Online Improper Learning with an Approximation Oracle »
Elad Hazan · Wei Hu · Yuanzhi Li · Zhiyuan Li -
2018 Poster: Implicit Bias of Gradient Descent on Linear Convolutional Networks »
Suriya Gunasekar · Jason Lee · Daniel Soudry · Nati Srebro -
2018 Poster: On preserving non-discrimination when combining expert advice »
Avrim Blum · Suriya Gunasekar · Thodoris Lykouris · Nati Srebro -
2017 Poster: Implicit Regularization in Matrix Factorization »
Suriya Gunasekar · Blake Woodworth · Srinadh Bhojanapalli · Behnam Neyshabur · Nati Srebro -
2017 Spotlight: Implicit Regularization in Matrix Factorization »
Suriya Gunasekar · Blake Woodworth · Srinadh Bhojanapalli · Behnam Neyshabur · Nati Srebro -
2016 Poster: Leveraging Sparsity for Efficient Submodular Data Summarization »
Erik Lindgren · Shanshan Wu · Alex Dimakis -
2016 Poster: Preference Completion from Partial Rankings »
Suriya Gunasekar · Sanmi Koyejo · Joydeep Ghosh -
2016 Poster: Single Pass PCA of Matrix Products »
Shanshan Wu · Srinadh Bhojanapalli · Sujay Sanghavi · Alex Dimakis -
2016 Poster: Solving Marginal MAP Problems with NP Oracles and Parity Constraints »
Yexiang Xue · zhiyuan li · Stefano Ermon · Carla Gomes · Bart Selman -
2016 Poster: Learning in Games: Robustness of Fast Convergence »
Dylan Foster · zhiyuan li · Thodoris Lykouris · Karthik Sridharan · Eva Tardos -
2015 Poster: Unified View of Matrix Completion under General Structural Constraints »
Suriya Gunasekar · Arindam Banerjee · Joydeep Ghosh