Yggdrasil: Bridging Dynamic Speculation and Static Runtime for Latency-Optimal Tree-Based LLM Decoding
Yue Guan ⋅ Changming Yu ⋅ Shihan Fang ⋅ Weiming Hu ⋅ Zaifeng Pan ⋅ Zheng Wang ⋅ Zihan Liu ⋅ Yangjie Zhou ⋅ Yufei Ding ⋅ Minyi Guo ⋅ Jingwen Leng
Abstract
Speculative decoding improves LLM inference by generating and verifying multiple tokens in parallel, but existing systems suffer from suboptimal performance due to a mismatch between dynamic speculation and static runtime assumptions. We present Yggdrasil, a co-designed system that enables latency-optimal speculative decoding through context-aware tree drafting and compiler-friendly execution. Yggdrasil introduces an equal-growth tree structure for static graph compatibility, a latency-aware optimization objective for draft selection, and stage-based scheduling to reduce overhead. Yggdrasil supports unmodified LLMs and achieves up to $3.98\times$ speedup over state-of-the-art baselines across multiple hardware setups.
Video
Chat is not available.
Successful Page Load