The Contamination Paradox: Why Test Set Leakage Can Be Both Potent and Negligible
Abstract
Accurately evaluating the capabilities of large language models is critical for both machine learning research and society alike, but is undermined by leakage of benchmark test data into pretraining corpora.Circumstantial and causal evidence alike demonstrate that benchmark performance increases with model size and with the number of benchmark replicas in pretraining corpora.However, recent work by Bordt et al. (2025) demonstrated that test set contamination has little-to-no impact in the "overtrained'' regime common to frontier AI systems, raising an apparent paradox of how test set leakage can be both potent and negligible.We resolve this paradox with a simple explanation: a language model memorizes a benchmark test set based on its capacity (number of parameters) and its incentive (the relative training loss reduction from memorizing test data).We introduce a novel dose-response framework to quantitatively relate how the "response'' of benchmark performance depends on the "dose" of the proportion of benchmark tokens contaminating the pretraining data, mediated by model size.This allows us to extract precise scaling relationships that clarify the effect of test set contamination on model performance.