Poster
H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
Zhenyu Zhang · Ying Sheng · Tianyi Zhou · Tianlong Chen · Tianlong Chen · Lianmin Zheng · Ruisi Cai · Zhao Song · Yuandong Tian · Christopher Ré · Clark Barrett · Zhangyang "Atlas" Wang · Beidi Chen
Great Hall & Hall B1+B2 (level 1) #534
Abstract:
Large Language Models (LLMs), despite their recent impressive accomplishments, are notably cost-prohibitive to deploy, particularly for applications involving long-content generation, such as dialogue systems and story writing. Often, a large amount of transient state information, referred to as the , is stored in GPU memory in addition to model parameters, scaling linearly with the sequence length and batch size. In this paper, we introduce a novel approach for implementing the which significantly reduces its memory footprint. Our approach is based on the noteworthy observation that a small portion of tokens contributes most of the value when computing attention scores. We call these tokens Heavy Hitters (). Through a comprehensive investigation, we find that () the emergence of is natural and strongly correlates with the frequent co-occurrence of tokens in the text, and () removing them results in significant performance degradation. Based on these insights, we propose Heavy Hitter Oracle (), a eviction policy that dynamically retains a balance of recent and tokens. We formulate the eviction as a dynamic submodular problem and prove (under mild assumptions) a theoretical guarantee for our novel eviction algorithm which could help guide future work. We validate the accuracy of our algorithm with OPT, LLaMA, and GPT-NeoX across a wide range of tasks. Our implementation of with 20\% heavy hitters improves the throughput over three leading inference systems DeepSpeed Zero-Inference, Hugging Face Accelerate, and FlexGen by up to , , and on OPT-6.7B and OPT-30B. With the same batch size, can reduce the latency by up to .
Chat is not available.