Efficient Transformers: State of the art in pruning, sparse attention, and transformer funneling
Abstract
Transformer architectures consume the lionshare of computational budgets associated with today’s most powerful language and vision models, making research into greater computational efficiency a hot and essential direction. Our proposed tutorial surveys the bleeding edge of three complementary research threads that together comprise a significant part of the current industrial toolkit for achieving computational efficiency in Transformers: (1) pruning, the structured or unstructured removal of weights, layers and heads; (2) sparse attention & routing, including block, sliding-window, locality-sensitive hashing; and (3) funneling, which pools intermediate representations to shorten sequences through depth. We will then feature an expert industrial and academic panel of speakers from Google Deepmind, MIT, UC Berkeley, and Columbia, hearing about the latest trends seen in top industrial labs. Attendees will leave with actionable recipes for building sub-10 B-parameter models that match or exceed dense baselines on language, vision and multi-modal benchmarks.
The tutorial targets researchers and practitioners who build or deploy Transformer models and assumes familiarity with basic deep-learning concepts but not with any specific efficiency method. All slides and publication materials will be released under a permissive license.
Schedule
|
|
|
|
|
|
|
11:00 AM
|
|
11:30 AM
|