Scale Test-Time Compute on Modern Hardware
Abstract
Large language models have achieved significant breakthroughs in reasoning tasks, relying on the effective use of test-time compute. Techniques such as chain-of-thought and sampling-based strategies have shown that increasing test-time computation can dramatically enhance model performance. Our recent scaling law analyses highlight the critical role of test-time compute in enabling advanced reasoning, beyond what pretraining can offer. We also provide a practical analysis of hardware efficiency, revealing where bottlenecks arise and how they differ fundamentally from those in pretraining. Scaling test-time compute on modern hardware presents unique challenges. Compared to training workloads, test-time compute often exhibits low parallelism, irregular workload, frequent memory I/O, and dynamic execution paths, all of which make efficient deployment difficult. Therefore, practical scalability is often bottlenecked by system constraints, such as attention-related memory overheads and limited compute utilization. To address these challenges, the community has explored solutions across both systems and algorithms. On the system side, advancements include memory-efficient key-value cache management, optimized attention kernels, and scheduling mechanisms for adaptive resource allocation. On the algorithm side, emerging work has proposed model architectures and parallel generation paradigms that better align with hardware. This tutorial aims to provide a comprehensive overview of the landscape of scalable test-time compute. We will cover foundational challenges, review recent progress from both system and algorithm perspectives, and discuss principles for building solutions that are truly compatible with modern hardware. By bridging theory with deployment realities, we hope this tutorial will inspire and accelerate the development of practical, scalable LLM agent systems.
Schedule
|
|
|
|
|
|
|
|