Poster

Searching for Efficient Linear Layers over a Continuous Space of Structured Matrices

Andres Potapczynski ⋅ Shikai Qiu ⋅ Marc Finzi ⋅ Christopher Ferri ⋅ Charlie Chen ⋅ Micah Goldblum ⋅ C. Bayan Bruss ⋅ Christopher De Sa ⋅ Andrew Wilson

2024 Poster

[ Paper] [ OpenReview]

Abstract

Dense linear layers are the dominant computational bottleneck in large neural networks, presenting a critical need for more efficient alternatives. Previous efforts to develop alternatives have focused on a small number of hand-crafted structured matrices, and have neglected to investigate whether these structures can surpass dense layers in terms of compute-optimal scaling laws when both the model size and training examples are optimally allocated. In this work, we present a unifying framework that enables searching among all linear operators expressible via an Einstein summation. This framework encompasses many previously proposed structures, such as low-rank, Kronecker, Tensor-Train, and Monarch, along with many novel structures. We develop a taxonomy of all such operators based on their computational and algebraic properties, which provides insights into their scaling laws. Combining these insights with empirical evaluation, we identify a subset of structures that achieve equal or better performance than dense layers as a function of training compute. To further improve their compute efficiency, we develop a natural extension of these performant structures that convert them into a sparse Mixture-of-Experts layer. The resulting layer significantly outperforms dense layers in compute-optimal training efficiency for GPT-2 language models.

Video

Chat is not available.