Skip to yearly menu bar Skip to main content


Poster

MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models

Zichun Yu · Spandan Das · Chenyan Xiong


Abstract:

Selecting high-quality pretraining data can potentially significantly enhance the pretraining efficiency and effectiveness of large language models. Current data selection methods, either using hand-crafted rules or larger reference models, are all conducted statically without capturing the evolving data preferences during pretraining. In this paper, we introduce \textit{model-aware data selection with data influence models (MATES)}, where a data influence model continuously adapts to the evolving data preferences of the main pretraining model, thus selecting data most effective for the model's current learning progress. Specifically, we leverage a small data influence model to approximate oracle data preference signals collected by locally probing with the main model, and to select more effective pretraining data for the next pretraining stage. Experiments on Pythia and the C4 dataset demonstrate that MATES significantly outperforms random data selection on extensive downstream tasks, doubling the gains obtained by recent LLM-based data selection approaches in both zero- and few-shot settings, while reducing total FLOPs required to reach certain performances by half. Further analysis validates the ever-changing data preferences of the pretraining models and the effectiveness of our data influence models to capture it. We will open-source our code, data, and model checkpoints via GitHub.

Live content is unavailable. Log in and register to view live content