NeurIPS Poster Fast Attention Requires Bounded Entries

Poster

Fast Attention Requires Bounded Entries

Josh Alman · Zhao Song

Great Hall & Hall B1+B2 (level 1) #2003

[ Abstract ]

[ Paper] [ OpenReview]

Abstract: In modern machine learning, inner product attention computation is a fundamental task for training large language models such as Transformer, GPT-1, BERT, GPT-2, GPT-3 and ChatGPT. Formally, in this problem, one is given as input three matrices

Q, K, V \in [- B, B]^{n \times d}

$Q, K, V \in [-B,B]^{n \times d}$ , and the goal is to construct the matrix

A t t (Q, K, V) := d i a g (A 1_{n})^{- 1} A V \in R^{n \times d}

$\mathrm{Att}(Q,K,V) := \mathrm{diag}(A {\bf 1}_n)^{-1} A V \in \mathbb{R}^{n \times d}$ , where

A = \exp (Q K^{⊤} / d)

$A = \exp(QK^\top/d)$ is the `attention matrix', and

\exp

$\exp$ is applied entry-wise. Straightforward methods for this problem explicitly compute the

n \times n

$n \times n$ attention matrix

A

$A$ , and hence require time

Ω (n^{2})

$\Omega(n^2)$ even when

d = n^{o (1)}

$d = n^{o(1)}$ is small. In this paper, we investigate whether faster algorithms are possible by \emph{implicitly} making use of the matrix

A

$A$ . We present two results, showing that there is a sharp transition at

B = Θ (\sqrt{\log n})

$B = \Theta(\sqrt{\log n})$ .

∙

$\bullet$ If

d = O (\log n)

$d = O(\log n)$ and

B = o (\sqrt{\log n})

$B = o(\sqrt{\log n})$ , there is an

n^{1 + o (1)}

$n^{1+o(1)}$ time algorithm to approximate

A t t (Q, K, V)

$\mathrm{Att}(Q,K,V)$ up to

1 / p o l y (n)

$1/\mathrm{poly}(n)$ additive error.

∙

$\bullet$ If

d = O (\log n)

$d = O(\log n)$ and

B = Θ (\sqrt{\log n})

$B = \Theta (\sqrt{\log n})$ , assuming the Strong Exponential Time Hypothesis from fine-grained complexity theory, it is impossible to approximate

A t t (Q, K, V)

$\mathrm{Att}(Q,K,V)$ up to

1 / p o l y (n)

$1/\mathrm{poly}(n)$ additive error in truly subquadratic time

n^{2 - Ω (1)}

$n^{2 - \Omega(1)}$ .This gives a theoretical explanation for the phenomenon observed in practice that attention computation is much more efficient when the input matrices have smaller entries.

Chat is not available.