Train-before-Test Harmonizes Language Model Rankings
Abstract
Existing language model benchmarks provide contradictory model rankings, even for benchmarks capturing similar skills.This hampers model selection and adds confusion to the growing ecosystem of competing models.We propose a fundamental shift in evaluation methodology: rather than measuring out-of-the-box performance, we assess model potential---achievable performance after task-specific fine-tuning.Our train-before-test approach provides each model with identical benchmark-specific fine-tuning prior to evaluation.Our primary contribution is a comprehensive empirical evaluation of model potential across 24 benchmarks and 61 models.First, we demonstrate that model potential rankings through train-before-test exhibit remarkable consistency across all benchmarks.While traditional rankings show little external validity under direct evaluation, they enjoy significant external validity with train-before-test: model potential rankings transfer gracefully between benchmarks.Second, train-before-test restores the connection between perplexity and downstream task performance.For base models, even pre-fine-tuning perplexity predicts post-fine-tuning downstream performance, suggesting ranking consistency reflects inherent model potential rather than fine-tuning artifacts.Finally, train-before-test reduces the model-score matrix to essentially rank one, indicating model potential is dominated by one latent factor.