Poster
in
Workshop: Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling

Towards Dynamic KV-Cache Compression: Fine-Grained Evaluation of Key and Value Ranks in LLMs

Jian Chen · Zhuoran Wang · Jiayu Qin · Ming Li · Meng Wang · Changyou Chen · Yin Chen · Qizhen Weng · Yirui Liu

Project Page [ Slides] [ OpenReview]

Abstract

Large language models rely on KV-cache to avoid redundant computation during autoregressive decoding, but reading and writing the growing cache quickly overwhelms GPU memory bandwidth as context length increases. Recent studies therefore explore KV-cache compression, however existing work either overlook the data-dependent nature of key/value features or their layer level differences. In this work, we propose a method that directly computes the optimal data-dependent compression of key and value activations via singular value decomposition during inference. Our approach is gradient-free and incremental, enabling independent per-layer decomposition with batch computation and low memory cost. Using this method, we conduct a comprehensive analysis across multiple models and datasets spanning diverse domains and languages, uncovering fine-grained patterns of KV-cache compressibility. Our method serves as a valuable evaluation tool to reveal how LLMs allocate their representational capacity, offering actionable insights for designing dynamic and data-aware KV-cache compression strategies for deployment.

Chat is not available.