BiT: Robustly Binarized Multi-distilled Transformer

Zechun Liu · Barlas Oguz · Aasish Pappu · Lin Xiao · Scott Yih · Meng Li · Raghuraman Krishnamoorthi · Yashar Mehdad

Hall J #105

Keywords: [ Binary neural networks ] [ Natural Language Processing ] [ BERT ] [ compression ] [ transformers ]

[ Abstract ]
[ Paper [ Poster [ OpenReview
Thu 1 Dec 2 p.m. PST — 4 p.m. PST
Spotlight presentation: Lightning Talks 2A-4
Tue 6 Dec 6:30 p.m. PST — 6:45 p.m. PST


Modern pre-trained transformers have rapidly advanced the state-of-the-art in machine learning, but have also grown in parameters and computational complexity, making them increasingly difficult to deploy in resource-constrained environments. Binarization of the weights and activations of the network can significantly alleviate these issues, however, is technically challenging from an optimization perspective. In this work, we identify a series of improvements that enables binary transformers at a much higher accuracy than what was possible previously. These include a two-set binarization scheme, a novel elastic binary activation function with learned parameters, and a method to quantize a network to its limit by successively distilling higher precision models into lower precision students. These approaches allow for the first time, fully binarized transformer models that are at a practical level of accuracy, approaching a full-precision BERT baseline on the GLUE language understanding benchmark within as little as 5.9%. Code and models are available at:

Chat is not available.