NeurIPS 2025 Learning Linear Attention in Polynomial Time Oral

Oral

Learning Linear Attention in Polynomial Time

Morris Yau · Ekin Akyürek · Jiayuan Mao · Josh Tenenbaum · Stefanie Jegelka · Jacob Andreas

Upper Level Ballroom 20D

Thu 4 Dec 10:40 a.m. PST — 11 a.m. PST

[ Abstract ] [ Visit Oral 3E Theory 3 ]

[ OpenReview]

Abstract:

Previous research has explored the expressivity of Transformer models in simulating Boolean circuits or Turing machines. However, the efficient learnability of Transformers from data has remained an open question. Our study addresses this gap by providing the first polynomial-time learnability results (specifically strong, agnostic PAC learning) for single-layer Transformers with linear attention. We show that learning the optimal multi head linear attention can be recast as finding the optimal kernel predictor in a suitably defined RKHS. Moving to generalization, we construct an algorithm that, given a dataset, checks in polynomial time whether the set of best fit multi head linear attention networks on this data all perform an identical computation--a powerful notion for out of distribution generalization. We empirically validate our theoretical findings on several canonical tasks: learning random linear attention networks, key--value associations, and learning to execute finite automata. Our findings bridge a critical gap between theoretical expressivity and learnability of Transformer models.

Chat is not available.