Poster
in
Workshop: The Fourth Workshop on Efficient Natural Language and Speech Processing (ENLSP-IV): Highlighting New Architectures for Future Foundation Models
Snakes and Ladders: Accelerating SSM Inference with Speculative Decoding
Yangchao Wu · Yonatan Dukler · Matthew Trager · Alessandro Achille · Wei Xia · Stefano Soatto
Keywords: [ Efficient Inference ]
Abstract:
Speculative decoding is a method for accelerating inference in large language models (LLMs) by predicting multiple tokens using a smaller base model.' If a draft token is inconsistent with what the base model would have generated, speculative decoding `Activation Replay.'' Both methods utilize idle computational resources to speculate and verify multiple tokens, allowing us to produce 6 tokens for 1.47 the cost of one, corresponding to an average 1.82 wall-clock speed-up on three different benchmarks using a simple -gram for drafting. Furthermore, as model size increases, relative overhead of speculation and verification decreases: Scaling from 1.3B parameters to 13B reduces relative overhead from 1.98 to 1.22. Unlike Transformers, speculative decoding in SSMs can be easily applied to batches of sequences, allowing dynamic allocation of resources to fill gaps in compute utilization and thereby improving efficiency and throughput with variable inference traffic.
Chat is not available.