Poster
in
Workshop: Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling

Train-before-Test Harmonizes Language Model Rankings

Guanhua Zhang · Ricardo Dominguez-Olmedo · Moritz Hardt

Project Page [ OpenReview]

Abstract

Existing language model benchmarks provide contradictory model rankings, even for benchmarks capturing similar skills.This hampers model selection and adds confusion to the growing ecosystem of competing models.We propose a fundamental shift in evaluation methodology: rather than measuring out-of-the-box performance, we assess model potential---achievable performance after task-specific fine-tuning.Our train-before-test approach provides each model with identical benchmark-specific fine-tuning prior to evaluation.Our primary contribution is a comprehensive empirical evaluation of model potential across 24 benchmarks and 61 models.First, we demonstrate that model potential rankings through train-before-test exhibit remarkable consistency across all benchmarks.While traditional rankings show little external validity under direct evaluation, they enjoy significant external validity with train-before-test: model potential rankings transfer gracefully between benchmarks.Second, train-before-test restores the connection between perplexity and downstream task performance.For base models, even pre-fine-tuning perplexity predicts post-fine-tuning downstream performance, suggesting ranking consistency reflects inherent model potential rather than fine-tuning artifacts.Finally, train-before-test reduces the model-score matrix to essentially rank one, indicating model potential is dominated by one latent factor.

Chat is not available.