Timezone: »
The quadratic computational and memory complexities of the Transformer's attention mechanism have limited its scalability for modeling long sequences. In this paper, we propose Luna, a linear unified nested attention mechanism that approximates softmax attention with two nested linear attention functions, yielding only linear (as opposed to quadratic) time and space complexity. Specifically, with the first attention function, Luna packs the input sequence into a sequence of fixed length. Then, the packed sequence is unpacked using the second attention function. As compared to a more traditional attention mechanism, Luna introduces an additional sequence with a fixed length as input and an additional corresponding output, which allows Luna to perform attention operation linearly, while also storing adequate contextual information. We perform extensive evaluations on three benchmarks of sequence modeling tasks: long-context sequence modelling, neural machine translation and masked language modeling for large-scale pretraining. Competitive or even better experimental results demonstrate both the effectiveness and efficiency of Luna compared to a variety of strong baseline methods including the full-rank attention and other efficient sparse and dense attention methods.
Author Information
Xuezhe Ma (University of Southern California)
Xiang Kong (Carnegie Mellon University)
Sinong Wang (Facebook AI)
Chunting Zhou (Language Technologies Institute, Carnegie Mellon University)
Jonathan May (University of Southern California)
Hao Ma (Facebook AI)
Luke Zettlemoyer (University of Washington and Facebook)
More from the Same Authors
-
2022 Workshop: Transfer Learning for Natural Language Processing »
Alon Albalak · Colin Raffel · Chunting Zhou · Deepak Ramachandran · Xuezhe Ma · Sebastian Ruder -
2022 Poster: GPT3.int8(): 8-bit Matrix Multiplication for Transformers at Scale »
Tim Dettmers · Mike Lewis · Younes Belkada · Luke Zettlemoyer -
2022 Poster: Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models »
Kushal Tirumala · Aram Markosyan · Luke Zettlemoyer · Armen Aghajanyan -
2022 Poster: Improving Policy Learning via Language Dynamics Distillation »
Victor Zhong · Jesse Mu · Luke Zettlemoyer · Edward Grefenstette · Tim Rocktäschel -
2021 : Panel Discussion »
Pascal Poupart · Ali Ghodsi · Luke Zettlemoyer · Sameer Singh · Kevin Duh · Yejin Choi · Lu Hou -
2021 : Toward Efficient Training of Large Language Models with Balanced Conditional Compute »
Luke Zettlemoyer -
2021 Poster: SILG: The Multi-domain Symbolic Interactive Language Grounding Benchmark »
Victor Zhong · Austin W. Hanjie · Sida Wang · Karthik Narasimhan · Luke Zettlemoyer -
2020 : Invited talk - De-noising Sequence-to-Sequence Pre-training »
Luke Zettlemoyer -
2020 Poster: Pre-training via Paraphrasing »
Mike Lewis · Marjan Ghazvininejad · Gargi Ghosh · Armen Aghajanyan · Sida Wang · Luke Zettlemoyer -
2017 : End-to-end Learning for Broad Coverage Semantics: SRL, Coreference, and Beyond »
Luke Zettlemoyer -
2008 Poster: Multi-Agent Filtering with Infinitely Nested Beliefs »
Luke Zettlemoyer · Brian Milch · Leslie Kaelbling -
2008 Spotlight: Multi-Agent Filtering with Infinitely Nested Beliefs »
Luke Zettlemoyer · Brian Milch · Leslie Kaelbling