Skip to yearly menu bar Skip to main content

Workshop: Machine Learning for Systems

HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer Compression

Jiaqi Gu · Ben Keller · Jean Kossaifi · Anima Anandkumar · Brucek Khailany · David Pan


Self-attention and feedforward layers in large-scale Transformer models are overparameterized, limiting inference speed and energy efficiency. Tensor decomposition is a promising technique to reduce parameter redundancy by expressing weight matrices in an efficiently factorized form. Prior efforts used manual or heuristic decomposition settings without hardware-aware customization, resulting in poor hardware efficiencies and large performance degradation.In this work, we propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of tensor decompositions and automates the choice of tensorization shape and decomposition rank with hardware-aware co-optimization. We jointly investigate tensor contraction path optimizations and a fused Einsum mapping strategy to bridge the gap between theoretical benefits and real hardware efficiency improvement. Our two-stage knowledge distillation flow resolves the trainability bottleneck and thus significantly boosts the final accuracy of factorized Transformers. We find that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss and achieve a better efficiency-accuracy Pareto frontier than hand-tuned and heuristic baselines.

Chat is not available.