Poster
in
Workshop: ML x OR: Mathematical Foundations and Operational Integration of Machine Learning for Uncertainty-Aware Decision-Making

Optimizing LLM Inference: Fluid-Based Online Scheduling under Memory Constraints

Ruicheng Ao · Gan Luo · David Simchi-Levi · Wang

Project Page [ OpenReview]

Abstract

Large Language Model (LLM) inference faces unique scheduling challenges due to the dynamically growing Key-Value (KV) cache during token generation, making traditional scheduling algorithms ineffective. We develop a fluid dynamics approximation to establish an optimal throughput benchmark and propose the WAIT (Waiting for Accumulated Inference Threshold) algorithm that achieves near-optimal performance with near-optimal throughput gap. For practical scenarios with unknown output lengths, we introduce Nested WAIT that maintains asymptotic optimality through hierarchical segmentation. Experiments on Llama-7B demonstrate 20-30\% throughput improvements over state-of-the-art systems like vLLM.

Chat is not available.