Timezone: »
The transformer architecture has become a dominant paradigm in an overwhelming number of natural language processing (NLP) tasks. Its headline feature, the multi-head attention (MHA) mechanism, is remarkable at capturing pertinent relations within input sequences, but at the cost of high quadratic complexity in compute and memory. We address this by pruning attention heads. Despite existing work on this line of pruning, there is ambiguity as to whether there is an optimal strategy to create a pruned transformer model based on head importance. Our initial aim is to evaluate multiple pruning techniques to understand the aspects of a method that generally lead to a better trade-off in run-time speed and accuracy of the pruned model. A key constraint to note, however, is that due to the self-attention operation carried out in transformer heads, their importance is input dependent to a large extent. In the current design, heads that may be salient for particular inputs are permanently lost, which means that pruned models will rarely be able to restore original levels of accuracy. This prompts the question: can we dynamically determine the pruning configuration of the model based on inputs during run time? We try to achieve this by introducing a novel technique to carry out dynamic head pruning in transformers.
Author Information
Prisha Satwani (The London School of Economics)
yiren zhao (University of Cambridge)
Vidhi Lalchand (University of Cambridge)
Ph.D student in Machine learning at Cambridge, I work on Bayesian Non-parametrics, Gaussian Processes, Kernel Learning. Application Areas: High Energy Physics, Astronomy, Science!
Robert Mullins (University of Cambridge)
More from the Same Authors
-
2021 : DAdaQuant: Doubly-adaptive quantization for communication-efficient Federated Learning »
Robert Hönig · Yiren Zhao · Robert Mullins -
2022 : Revisiting Graph Neural Network Embeddings »
Skye Purchase · yiren zhao · Robert Mullins -
2022 : Gaussian Process parameterized Covariance Kernels for Non-stationary Regression »
Vidhi Lalchand · Talay Cheema · Laurence Aitchison · Carl Edward Rasmussen -
2022 : Wide Attention Is The Way Forward For Transformers »
Jason Brown · Yiren Zhao · I Shumailov · Robert Mullins -
2022 : DARTFormer: Finding The Best Type Of Attention »
Jason Brown · Yiren Zhao · I Shumailov · Robert Mullins -
2022 : Wide Attention Is The Way Forward For Transformers »
Jason Brown · Yiren Zhao · I Shumailov · Robert Mullins -
2022 Poster: Sparse Gaussian Process Hyperparameters: Optimize or Integrate? »
Vidhi Lalchand · Wessel Bruinsma · David Burt · Carl Edward Rasmussen -
2022 Poster: Rapid Model Architecture Adaption for Meta-Learning »
Yiren Zhao · Xitong Gao · I Shumailov · Nicolo Fusi · Robert Mullins -
2021 Poster: Kernel Identification Through Transformers »
Fergus Simpson · Ian Davies · Vidhi Lalchand · Alessandro Vullo · Nicolas Durrande · Carl Edward Rasmussen -
2021 Poster: Marginalised Gaussian Processes with Nested Sampling »
Fergus Simpson · Vidhi Lalchand · Carl Edward Rasmussen -
2019 Poster: Focused Quantization for Sparse CNNs »
Yiren Zhao · Xitong Gao · Daniel Bates · Robert Mullins · Cheng-Zhong Xu