Poster
in
Workshop: Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling

Attention, Please: Single-Head Cross-Attention for Unified LLM Routing

Roshini Pulishetty · Mani Kishan Ghantasala · Keerthy Kaushik Dasoju · Niti Mangwani · Vishal Garimella · Aditya Mate · Somya Sharma · Yue Kang · Ehi Nosakhare · Sadid A. Hasan · Soundararajan Srinivasan

Project Page [ Poster] [ OpenReview]

Abstract

The growing diversity of language models, ranging from lightweight (small), cost-efficient models to powerful (large) but expensive ones, has made dynamic model selection essential for scalable and cost-effective deployment. We propose a unified LLM routing framework that jointly models query and model embeddings using a single-head cross-attention mechanism. We evaluate router's decision-making capabilities on a large-scale, publicly available dataset called RouterBench, that enables evaluation across multiple LLM pools and domains. By capturing fine-grained query-model interactions, our router learns to predict both response quality and generation cost, outperforming existing predictive routers by up to 6.6\% in Average Improvement in Quality (AIQ) and 2.9\% in maximum performance. To better reflect the trade-off between performance and cost, we adopt a new exponential reward function with improved robustness. Our architecture is lightweight, generalizes well across various domains, and is more efficient than existing ones.

Chat is not available.