Timezone: »
The surprising discovery of the BYOL method shows the negative samples can be replaced by adding the prediction head to the network. It is mysterious why even when there exist trivial collapsed global optimal solutions, neural networks trained by (stochastic) gradient descent can still learn competitive representations. In this work, we present our empirical and theoretical discoveries on non-contrastive self-supervised learning. Empirically, we find that when the prediction head is initialized as an identity matrix with only its off-diagonal entries being trainable, the network can learn competitive representations even though the trivial optima still exist in the training objective. Theoretically, we characterized the substitution effect and acceleration effect of the trainable, but identity-initialized prediction head. The substitution effect happens when learning the stronger features in some neurons can substitute for learning these features in other neurons through updating the prediction head. And the acceleration effect happens when the substituted features can accelerate the learning of other weaker features to prevent them from being ignored. These two effects enable the neural networks to learn diversified features rather than focus only on learning the strongest features, which is likely the cause of the dimensional collapse phenomenon. To the best of our knowledge, this is also the first end-to-end optimization guarantee for non-contrastive methods using nonlinear neural networks with a trainable prediction head and normalization.
Author Information
Zixin Wen (Carnegie Mellon University)
Yuanzhi Li (CMU)
More from the Same Authors
-
2021 : Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization »
Difan Zou · Yuan Cao · Yuanzhi Li · Quanquan Gu -
2021 : Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization »
Difan Zou · Yuan Cao · Yuanzhi Li · Quanquan Gu -
2022 : Toward Understanding Why Adam Converges Faster Than SGD for Transformers »
Yan Pan · Yuanzhi Li -
2022 : Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions »
Sitan Chen · Sinho Chewi · Jerry Li · Yuanzhi Li · Adil Salim · Anru Zhang -
2023 Poster: Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals »
Yue Wu · Yewen Fan · Paul Pu Liang · Amos Azaria · Yuanzhi Li · Tom Mitchell -
2023 Poster: SPRING: Studying Papers and Reasoning to play Games »
Yue Wu · So Yeon Min · Shrimai Prabhumoye · Yonatan Bisk · Russ Salakhutdinov · Amos Azaria · Tom Mitchell · Yuanzhi Li -
2023 Poster: How Does Adaptive Optimization Impact Local Neural Network Geometry? »
Kaiqi Jiang · Dhruv Malik · Yuanzhi Li -
2023 Poster: The probability flow ODE is provably fast »
Sitan Chen · Sinho Chewi · Holden Lee · Yuanzhi Li · Jianfeng Lu · Adil Salim -
2022 Poster: Towards Understanding the Mixture-of-Experts Layer in Deep Learning »
Zixiang Chen · Yihe Deng · Yue Wu · Quanquan Gu · Yuanzhi Li -
2022 Poster: Vision Transformers provably learn spatial structure »
Samy Jelassi · Michael Sander · Yuanzhi Li -
2022 Poster: Learning (Very) Simple Generative Models Is Hard »
Sitan Chen · Jerry Li · Yuanzhi Li -
2021 Poster: Local Signal Adaptivity: Provable Feature Learning in Neural Networks Beyond Kernels »
Stefani Karp · Ezra Winston · Yuanzhi Li · Aarti Singh -
2021 Poster: When Is Generalizable Reinforcement Learning Tractable? »
Dhruv Malik · Yuanzhi Li · Pradeep Ravikumar