Attention, Please: Single-Head Cross-Attention for Unified LLM Routing
Abstract
The growing diversity of language models, ranging from lightweight (small), cost-efficient models to powerful (large) but expensive ones, has made dynamic model selection essential for scalable and cost-effective deployment. We propose a unified LLM routing framework that jointly models query and model embeddings using a single-head cross-attention mechanism. We evaluate router's decision-making capabilities on a large-scale, publicly available dataset called RouterBench, that enables evaluation across multiple LLM pools and domains. By capturing fine-grained query-model interactions, our router learns to predict both response quality and generation cost, outperforming existing predictive routers by up to 6.6\% in Average Improvement in Quality (AIQ) and 2.9\% in maximum performance. To better reflect the trade-off between performance and cost, we adopt a new exponential reward function with improved robustness. Our architecture is lightweight, generalizes well across various domains, and is more efficient than existing ones.