Timezone: »

Understanding the Failure of Batch Normalization for Transformers in NLP
Jiaxi Wang · Ji Wu · Lei Huang

Wed Dec 07 09:00 AM -- 11:00 AM (PST) @

Batch Normalization (BN) is a core and prevalent technique in accelerating the training of deep neural networks and improving the generalization on Computer Vision (CV) tasks. However, it fails to defend its position in Natural Language Processing (NLP), which is dominated by Layer Normalization (LN). In this paper, we are trying to answer why BN usually performs worse than LN in NLP tasks with Transformer models. We find that the inconsistency between training and inference of BN is the leading cause that results in the failure of BN in NLP. We define Training Inference Discrepancy (TID) to quantitatively measure this inconsistency and reveal that TID can indicate BN's performance, supported by extensive experiments, including image classification, neural machine translation, language modeling, sequence labeling, and text classification tasks. We find that BN can obtain much better test performance than LN when TID keeps small through training. To suppress the explosion of TID, we propose Regularized BN (RBN) that adds a simple regularization term to narrow the gap between batch statistics and population statistics of BN. RBN improves the performance of BN consistently and outperforms or is on par with LN on 17 out of 20 settings, including ten datasets and two common variants of Transformer.

Author Information

Jiaxi Wang (Tsinghua University)
Ji Wu (Tsinghua University, Tsinghua University)
Lei Huang (Beihang University)

Lei Huang received his BSc and PhD degrees under supervision of Prof. Wei Li, respectively in 2010 and 2018, at the School of Computer Science and Engineering, Beihang University, China. From 2015 to 2016, he visited the Vision and Learning Lab, University of Michigan, Ann Arbor, as a joint PhD student supervised by Prof. Jia Deng. During 2018 to 2020, he was a research scientist in Inception Institute of Artificial Intelligence (IIAI), UAE. His current research mainly focuses on normalization techniques (involving methods, theories and applications) in training DNNs. He also has wide interests in deep learning theory (representation & optimization) and computers vision tasks. He serves as a reviewer for the top conferences and journals such as CVPR, ICML, ICCV, ECCV, NeurIPS, AAAI, JMLR, TPAMI, IJCV, TNNLS, etc.

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors

  • 2022 Poster: An Investigation into Whitening Loss for Self-supervised Learning »
    Xi Weng · Lei Huang · Lei Zhao · Rao Anwer · Salman Khan · Fahad Shahbaz Khan
  • 2022 Spotlight: Lightning Talks 3B-3 »
    Sitao Luan · Zhiyuan You · Ruofan Liu · Linhao Qu · Yuwei Fu · Jiaxi Wang · Chunyu Wei · Jian Liang · xiaoyuan luo · Di Wu · Yun Lin · Lei Cui · Ji Wu · Chenqing Hua · Yujun Shen · Qincheng Lu · XIANGLIN YANG · Benoit Boulet · Manning Wang · Di Liu · Lei Huang · Fei Wang · Kai Yang · Jiaqi Zhu · Jin Song Dong · Zhijian Song · Xin Lu · Mingde Zhao · Shuyuan Zhang · Yu Zheng · Xiao-Wen Chang · Xinyi Le · Doina Precup
  • 2022 Spotlight: Lightning Talks 1B-3 »
    Chaofei Wang · Qixun Wang · Jing Xu · Long-Kai Huang · Xi Weng · Fei Ye · Harsh Rangwani · shrinivas ramasubramanian · Yifei Wang · Qisen Yang · Xu Luo · Lei Huang · Adrian G. Bors · Ying Wei · Xinglin Pan · Sho Takemori · Hong Zhu · Rui Huang · Lei Zhao · Yisen Wang · Kato Takashi · Shiji Song · Yanan Li · Rao Anwer · Yuhei Umeda · Salman Khan · Gao Huang · Wenjie Pei · Fahad Shahbaz Khan · Venkatesh Babu R · Zenglin Xu
  • 2022 Spotlight: An Investigation into Whitening Loss for Self-supervised Learning »
    Xi Weng · Lei Huang · Lei Zhao · Rao Anwer · Salman Khan · Fahad Shahbaz Khan
  • 2021 Poster: Rot-Pro: Modeling Transitivity by Projection in Knowledge Graph Embedding »
    Tengwei Song · Jie Luo · Lei Huang