Optimizing LLM Inference: Fluid-Based Online Scheduling under Memory Constraints
Ruicheng Ao · Gan Luo · David Simchi-Levi · Wang
Abstract
Large Language Model (LLM) inference faces unique scheduling challenges due to the dynamically growing Key-Value (KV) cache during token generation, making traditional scheduling algorithms ineffective. We develop a fluid dynamics approximation to establish an optimal throughput benchmark and propose the WAIT (Waiting for Accumulated Inference Threshold) algorithm that achieves near-optimal performance with near-optimal throughput gap. For practical scenarios with unknown output lengths, we introduce Nested WAIT that maintains asymptotic optimality through hierarchical segmentation. Experiments on Llama-7B demonstrate 20-30\% throughput improvements over state-of-the-art systems like vLLM.
Chat is not available.
Successful Page Load