Skip to yearly menu bar Skip to main content

Workshop: Workshop on Federated Learning in the Age of Foundation Models in Conjunction with NeurIPS 2023 (FL@FM-NeurIPS'23)

parameter averaging laws for multitask language models

Woojin Chung · Hyowon Cho · James Thorne · Se-Young Yun

Keywords: [ multilingual language model ] [ multitask language model ] [ pretrained models ] [ parameter averaging ] [ federated learning ]


Parameter-averaging, a method for combining multiple models into a single one, has emerged as a promising approach to enhance performance without requiring additional space or retraining. Nonetheless, the conditions for successful parameter-averaging remain undefined, calling for further research to characterize them. In this study, we empirically investigate the influential factors for successful parameter-averaging and reveal \emph{positive correlations between representation power and the performance gain of parameter-averaging}. Specifically, we evaluate how computational budget, data diversity and vocabulary size contribute to representation power, and their influence on the success of parameter-averaging. Our results demonstrate that parameter-averaging improves the generalization ability for both in-domain and out-of-domain data. Additionally, to reduce the computational cost of parameter-averaging, we introduce \textit{partial averaging}, which assumes arbitrary participation of a subset of contributors. We observe that partial averaging outperforms fine-tuning for models with sufficient representation power. Furthermore, we find that the impact of data heterogeneity, which arises from different data distributions of contributors, reduces as the representation power of the model increases. These findings provide valuable insights into the principles governing parameter-averaging and its potential for enhancing model performance.

Chat is not available.