The Singular Anchor: First Token Dominance in Large Language Model Attention Sinks
Khurram Khalil
Abstract
Large Language Models rely on "attention sinks"—initial sequence tokens that accumulate disproportionate attention—for efficient context management. However, the precise formation and positional dominance of these natural sinks remain under-characterized. We present the first systematic empirical study investigating attention sink patterns across three LLM families (GPT-2, Llama, Mistral) and five text categories. Our analysis reveals that the absolute first token (P1) overwhelmingly serves as the dominant natural attention sink, attracting significantly more attention ($p < 0.001$, Cohen's $d > 6.0$) than subsequent initial tokens across all architectures. While P1 dominance is universal, its strength varies by model family—Mistral exhibits the strongest P1 reliance—and is significantly modulated by input characteristics, with short texts eliciting maximal P1 attention and code texts minimal. These findings challenge assumptions about distributed sink importance and provide foundational insights for designing efficient long-context models.
Successful Page Load