Poster
in
Workshop: Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling

Prompt Genotyping: Quantifying the Evaluation Gap Between Synthetic Benchmarks and Real LLM Performance

Sohum Mehta ⋅ Saaketh Bhojanam

Project Page [ OpenReview]

Abstract

LLM evaluation relies heavily on synthetic benchmarks, but how well do these predict real-world performance? We introduce Prompt Genotyping, a framework treating prompts as measurable ''genomes'' of 14 linguistic features to predict LLM ''phenotypes'' (performance outcomes). Using 1,112 real prompt-response pairs from MT-Bench and HELM plus 1,388 synthetic controls, we reveal a dramatic predictability gap: surface features explain 86\% of variance on algorithmic labels (R² = 0.86 ± 0.02) but achieve worse-than-random performance on authentic GPT-4o-mini outputs (R² = -0.134). This 1.0+ R² gap quantifies a fundamental challenge in the LLM evaluation methodology: Synthetic benchmark optimization may not be generalized to deployment scenarios. We establish the first leakage-free baseline for prompt failure prediction (F1=0.56, AUC=0.65) and release comprehensive evaluation resources to advance systematic, data-driven prompt assessment.

Chat is not available.