NeurIPS Expo Demonstration Disaggregated LLM Serving on AI Accelerators

Expo Demonstration

Disaggregated LLM Serving on AI Accelerators

Ron Tindall

Upper Level Room 29A-D

[ Abstract ]

Tue 2 Dec noon PST — 3 p.m. PST

Abstract:

This demo showcases disaggregated serving on Qualcomm Cloud AI 100 Ultra Card, a power-efficient AI inference accelerator purpose-built for large language models (LLMs) serving. The accelerator has been deployed across multiple cloud service providers (CSPs) globally and is actively serving state-of-the-art LLMs and other generative AI workloads. x000D
x000D
LLM inference typically involves two distinct stages: prefill and decode. The prefill stage is compute bound, while the decode stage is memory bound. Applying uniform parallelism strategies across both stages often results in suboptimal performance, particularly in key metrics such as Time to First Token (TTFT) and Requests Per Minute (RPM) at the cluster level. x000D
x000D
This demo highlights the performance benefits of disaggregated parallelism strategies tailored to the unique characteristics of each stage. By optimizing the execution of prefill and decode independently, we demonstrate significant improvements in TTFT and overall throughput. x000D
x000D
Key benefits: x000D
x000D
Improved TTFT: Faster initial response times for LLM queries. x000D
x000D
Higher throughput: Increased number of requests served per minute at the cluster level. x000D
x000D
Optimized resource utilization: Efficient mapping of compute and memory resources to match workload characteristics. x000D
x000D
SLA-adherent performance: Maintains service quality and responsiveness within strict latency and throughput requirements.

Chat is not available.