Poster
Fast Attention Requires Bounded Entries
Josh Alman · Zhao Song
Great Hall & Hall B1+B2 (level 1) #2003
Abstract:
In modern machine learning, inner product attention computation is a fundamental task for training large language models such as Transformer, GPT-1, BERT, GPT-2, GPT-3 and ChatGPT. Formally, in this problem, one is given as input three matrices , and the goal is to construct the matrix , where is the `attention matrix', and is applied entry-wise. Straightforward methods for this problem explicitly compute the attention matrix , and hence require time even when is small. In this paper, we investigate whether faster algorithms are possible by \emph{implicitly} making use of the matrix . We present two results, showing that there is a sharp transition at . If and , there is an time algorithm to approximate up to additive error. If and , assuming the Strong Exponential Time Hypothesis from fine-grained complexity theory, it is impossible to approximate up to additive error in truly subquadratic time .This gives a theoretical explanation for the phenomenon observed in practice that attention computation is much more efficient when the input matrices have smaller entries.
Chat is not available.