Skip to yearly menu bar Skip to main content


San Diego Oral Session

Oral 4A Language Model 2

Exhibit Hall F,G,H

Moderators: Samet Oymak · Muhan Zhang

Thu 4 Dec 3:30 p.m. PST — 4:30 p.m. PST
Abstract:
Chat is not available.

Thu 4 Dec. 15:30 - 15:50 PST

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

David Chanin · James Wilken-Smith · Tomáš Dulka · Hardik Bhatnagar · Satvik Golechha · Joseph Bloom

Sparse Autoencoders (SAEs) aim to decompose the activation space of large language models (LLMs) into human-interpretable latent directions or features. As we increase the number of features in the SAE, hierarchical features tend to split into finer features (“math” may split into “algebra”, “geometry”, etc.), a phenomenon referred to as feature splitting. However, we show that sparse decomposition and splitting of hierarchical features is not robust. Specifically, we show that seemingly monosemantic features fail to fire where they should, and instead get “absorbed” into their children features. We coin this phenomenon feature absorption, and show that it is caused by optimizing for sparsity in SAEs whenever the underlying features form a hierarchy. We introduce a metric to detect absorption in SAEs, and validate our findings empirically on hundreds of LLM SAEs. Our investigation suggests that varying SAE sizes or sparsity is insufficient to solve this issue. We discuss the implications of feature absorption in SAEs and some potential approaches to solve the fundamental theoretical issues before SAEs can be used for interpreting LLMs robustly and at scale.

Thu 4 Dec. 15:50 - 16:10 PST

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Zihan Qiu · Zekun Wang · Bo Zheng · Zeyu Huang · Kaiyue Wen · Songlin Yang · Rui Men · Le Yu · Fei Huang · Suozhi Huang · Dayiheng Liu · Jingren Zhou · Junyang Lin

Gating mechanisms have been widely utilized, from early models like LSTMs and Highway Networks to recent state space models, linear attention, and also softmax attention. Yet, existing literature rarely examines the specific effects of gating. In this work, we conduct comprehensive experiments to systematically investigate gating-augmented softmax attention variants. Specifically, we perform a comprehensive comparison over 30 variants of 15B Mixture-of-Experts (MoE) models and 1.7B dense models trained on a 3.5 trillion token dataset. Our central finding is that a simple modification—applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)—consistently improves performance. This modification also enhances training stability, tolerates larger learning rates, and improves scaling properties. By comparing various gating positions and computational variants, we attribute this effectiveness to two key factors: (1) introducing non-linearity upon the low-rank mapping in the softmax attention, and (2) applying query-dependent sparse gating scores to modulate the SDPA output. Notably, we find this sparse gating mechanism mitigates massive activation, attention sink and enhances long-context extrapolation performance. We also release related codes (https://github.com/qiuzh20/gatedattention}) and models (https://huggingface.co/QwQZh/gatedattention) to facilitate future research. Furthermore, the most effective SDPA output gating is used in the Qwen3-Next models (https://huggingface.co/collections/Qwen/qwen3-next).

Thu 4 Dec. 16:10 - 16:30 PST

Superposition Yields Robust Neural Scaling

Yizhou Liu · Ziming Liu · Jeff Gore

The success of today's large language models (LLMs) depends on the observation that larger models perform better. However, the origin of this neural scaling law, that loss decreases as a power law with model size, remains unclear. We propose that representation superposition, meaning that LLMs represent more features than they have dimensions, can be a key contributor to loss and cause neural scaling. Based on Anthropic's toy model, we use weight decay to control the degree of superposition, allowing us to systematically study how loss scales with model size. When superposition is weak, the loss follows a power law only if data feature frequencies are power-law distributed. In contrast, under strong superposition, the loss generically scales inversely with model dimension across a broad class of frequency distributions, due to geometric overlaps between representation vectors. We confirmed that open-sourced LLMs operate in the strong superposition regime and have loss scaling inversely with model dimension, and that the Chinchilla scaling laws are also consistent with this behavior. Our results identify representation superposition as a central driver of neural scaling laws, providing insights into questions like when neural scaling laws can be improved and when they will break down.