Regularizing the Entropy Landscape of Self-Attention: Towards a Soft Inductive Bias in LLMs
Abstract
Self-attention often looks under-utilized: many heads can be pruned with little loss. We revisit this inefficiency through the entropy landscape of multi-head attention and ask whether a {\em soft inductive bias} can steer optimization toward more useful regimes. To this end, we employ a head-wise entropy regularizer with learnable per-head strengths and optional softmax temperatures that penalize only excess entropy. On decoder-only language models (e.g., GPT-2), this simple training-time prior improves perplexity by up to {\bf 20\%} without inference overhead. Internally, it reshapes the entropy profile: early layers shift toward lower entropy, the high-entropy tail disappears, and head-role heterogeneity increases, reducing overlap among heads. Our findings suggest that standard training induces a high-entropy attractor in early self-attention layers and convergence toward homogeneous attention values in deeper layers. A soft entropic bias gently redirects this path, transforming redundancy into functional specialization while keeping inference cost unchanged.