Revive Legacy Scientific Reasoning Benchmark by Growing Perturbation
Abstract
Capability evaluation of large language models is increasingly shadowed by rising concerns of data contamination, which cast doubt as to whether static legacy benchmarks are measuring genuine reasoning or mere memorization.Benchmark perturbation emerges as a new frontier for dynamic evaluation by systematically modifying benchmark questions, such as altering numerical values in math problems, varying narrative contexts, or introducing adversarial triggers.This approach offers a promising method to distinguish authentic understanding from pattern matching.We propose to study benchmark perturbation in a unified framework where we compare the efficacy of various perturbation technique on a collection of legacy benchmarks adopted by leading modern developers. We further plan to modularize a variety of perturbation techniques to enable plug-and-play refreshing of benchmarks at scale for future evaluation. We aim to move beyond simple data collection for LLM benchmarks and eventually shift the paradigm towards contamination-resistant dynamic evaluation.