Poster
Building on Efficient Foundations: Effective Training of LLMs with Structured Feedforward Layers
Xiuying Wei · Skander Moalla · Razvan Pascanu · Caglar Gulcehre
East Exhibit Hall A-C #2010
State-of-the-art results in large language models (LLMs) often rely on scale, whichbecomes computationally expensive. This has sparked a research agenda to reducethese models’ parameter counts and computational costs without significantlyimpacting their performance. Our study focuses on transformer-based LLMs,specifically targeting the computationally intensive feedforward networks (FFNs),which are less studied than attention blocks. We consider three structured linearparameterizations of the FFN using efficient low-rank and block-diagonal matrices.In contrast to many previous works that examined these approximations, our studyi) explores these structures from a training-from-scratch perspective, ii) scales upto 1.3B parameters, and iii) is conducted within recent Transformer-based LLMsrather than convolutional architectures. We demonstrate that these structures canlead to actual computational gains in various scenarios, including online decodingwhen using a pre-merge technique. Additionally, we propose a novel trainingregime, called self-guided training, aimed at improving the poor training dynamicsthat these approximations exhibit when used from initialization. Interestingly,the scaling performance of structured matrices is explored, revealing steepercurves in scaling training FLOPs, along with a favorable scaling trend in theovertraining regime. Specifically, we show that wide and structured networkscan utilize training FLOPs more efficiently, with fewer parameters and lowerloss than dense models at their optimal trade-off. Our code is available athttps://github.com/CLAIRE-Labo/StructuredFFN/tree/main.
Live content is unavailable. Log in and register to view live content