Learning In-context n-grams with Transformers: Near-stationary of sub-n-grams and saddle-to-saddle dynamics, Nicolas Flammarion
Nicolas Flammarion
Abstract
Transformers acquire in-context learning capabilities through a distinctive training pattern: prolonged plateaus followed by sudden drops in loss, during which specialized circuits such as induction heads emerge. To investigate the mechanisms underlying this phenomenon, we first analyze the loss landscape of transformers trained on next-token prediction to account for these plateaus. We then characterize the associated training dynamics that give rise to the formation of induction heads. In particular, we focus on learning in-context (n)-gram language models with a simplified two-layer attention-only transformer.
Video
Chat is not available.
Successful Page Load