Skip to yearly menu bar Skip to main content

Workshop: All Things Attention: Bridging Different Perspectives on Attention

Foundations of Attention Mechanisms in Deep Neural Network Architectures

Pierre Baldi · Roman Vershynin

Keywords: [ transformers ] [ gating ] [ capacity ] [ attention capacity ] [ attention mechanisms taxonomy ] [ Attention Mechanisms ] [ foundations of attention ]

Abstract: We consider the foundations of attention mechanisms in deep neural network architectures and present three main results. First, we provide a systematic taxonomy of all possible attention mechanisms within, or as extensions of, the McCulloch and Pitt standard model into 18 classes depending on the origin type of the attention signal, the target type of the attention signal, and whether the interaction type is additive or multiplicative. Second, using this taxonomy, we identify three key attention mechanisms: output gating, synaptic gating, and multiplexing. Output gating and synaptic gating are extensions of the standard model and all current attention-based architectures, including transformers, use either output gating or synaptic gating, or a combination of both. Third, we develop a theory of attention capacity and derive mathematical results about the capacity of basic attention networks. For example, the output gating of a linear threshold gate of $n$ variables by another linear threshold gate of the same $n$ variables has capacity $2n^2 (1+o(1))$. Perhaps surprisingly, multiplexing attention is used in the proofs of these results. Synaptic and output gating provide computationally efficient extensions of the standard model allowing for {\it sparse} quadratic activation functions. They can also be viewed as primitives enabling the concise collapsing of multiple layers of processing in the standard model.

Chat is not available.