Poster Wed, Dec 3, 2025 • 11:00 AM – 2:00 PM PST Exhibit Hall C,D,E #5500

Position: Benchmarking is Broken - Don't Let AI be Its Own Judge

Zerui Cheng · Stella Wohnig · Ruchika Gupta · Samiul Alam · Tassallah Abdullahi · João Alves Ribeiro · Christian Nielsen-Garcia · Saif Mir · Siran Li · Jason Orender · Seyed Ali Bahrainian · Daniel Kirste · Aaron Gokaslan · Carsten Eickhoff · Ruben Wolff

Project Page [ Slides] [ Poster] [ OpenReview]

Abstract

The meteoric rise of Artificial Intelligence (AI), with its rapidly expanding market capitalization, presents both transformative opportunities and critical challenges. Chief among these is the urgent need for a new, unified paradigm for trustworthy evaluation, as current benchmarks increasingly reveal critical vulnerabilities. Issues like data contamination and selective reporting by model developers fuel hype, while inadequate data quality control can lead to biased evaluations that, even if unintentionally, may favor specific approaches. As a flood of participants enters the AI space, this "Wild West" of assessment makes distinguishing genuine progress from exaggerated claims exceptionally difficult. Such ambiguity blurs scientific signals and erodes public confidence, much as unchecked claims would destabilize financial markets reliant on credible oversight from agencies like Moody's.In high-stakes human examinations (e.g., SAT, GRE), substantial effort is devoted to ensuring fairness and credibility; why settle for less in evaluating AI, especially given its profound societal impact? This position paper argues that a laissez-faire approach is untenable. For true and sustainable AI advancement, we call for a paradigm shift to a unified, live, and quality-controlled benchmarking framework—robust by construction rather than reliant on courtesy or goodwill. Accordingly, we dissect the systemic flaws undermining today’s evaluation ecosystem and distill the essential requirements for next-generation assessments. To concretize this position, we introduce the idea of PeerBench, a community-governed, proctored evaluation blueprint that seeks to improve security and credibility through sealed execution, item banking with rolling renewal, and delayed transparency. PeerBench is presented as a complementary, certificate-grade layer alongside open benchmarks, not a replacement. We discuss trade-offs and limits and call for further research on mechanism design, governance, and reliability guarantees. Our goal is to lay the groundwork for evaluations that restore integrity and deliver genuinely trustworthy measures of AI progress.

Video

Chat is not available.