Timezone: »

 
HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer Compression
Jiaqi Gu · Ben Keller · Jean Kossaifi · Anima Anandkumar · Brucek Khailany · David Pan

Sat Dec 03 09:25 AM -- 09:35 AM (PST) @
Event URL: https://openreview.net/forum?id=x9ZP4pnlZHo »

Self-attention and feedforward layers in large-scale Transformer models are overparameterized, limiting inference speed and energy efficiency. Tensor decomposition is a promising technique to reduce parameter redundancy by expressing weight matrices in an efficiently factorized form. Prior efforts used manual or heuristic decomposition settings without hardware-aware customization, resulting in poor hardware efficiencies and large performance degradation.In this work, we propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of tensor decompositions and automates the choice of tensorization shape and decomposition rank with hardware-aware co-optimization. We jointly investigate tensor contraction path optimizations and a fused Einsum mapping strategy to bridge the gap between theoretical benefits and real hardware efficiency improvement. Our two-stage knowledge distillation flow resolves the trainability bottleneck and thus significantly boosts the final accuracy of factorized Transformers. We find that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss and achieve a better efficiency-accuracy Pareto frontier than hand-tuned and heuristic baselines.

Author Information

Jiaqi Gu (The University of Texas at Austin)
Ben Keller (NVIDIA)
Jean Kossaifi (NVIDIA Research)
Anima Anandkumar (NVIDIA / Caltech)
Brucek Khailany (NVIDIA)
David Pan (University of Texas, Austin)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors