Skip to yearly menu bar Skip to main content

Affinity Workshop: Women in Machine Learning

Dynamic Head Pruning in Transformers

Prisha Satwani · yiren zhao · Vidhi Lalchand · Robert Mullins


The transformer architecture has become a dominant paradigm in an overwhelming number of natural language processing (NLP) tasks. Its headline feature, the multi-head attention (MHA) mechanism, is remarkable at capturing pertinent relations within input sequences, but at the cost of high quadratic complexity in compute and memory. We address this by pruning attention heads. Despite existing work on this line of pruning, there is ambiguity as to whether there is an optimal strategy to create a pruned transformer model based on head importance. Our initial aim is to evaluate multiple pruning techniques to understand the aspects of a method that generally lead to a better trade-off in run-time speed and accuracy of the pruned model. A key constraint to note, however, is that due to the self-attention operation carried out in transformer heads, their importance is input dependent to a large extent. In the current design, heads that may be salient for particular inputs are permanently lost, which means that pruned models will rarely be able to restore original levels of accuracy. This prompts the question: can we dynamically determine the pruning configuration of the model based on inputs during run time? We try to achieve this by introducing a novel technique to carry out dynamic head pruning in transformers.

Chat is not available.