Probe-Rewrite-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation Awareness
Abstract
Benchmarks often overestimate LLM trustworthiness because models behave differently under evaluation than in real-world use. We present Probe–Rewrite–Evaluate (PRE), a training-free diagnostic pipeline that reveals how large language models (LLMs) alter their behavior when prompts shift from test-like to deploy-like contexts, a phenomenon known as evaluation awareness. PRE first applies a linear probe to assign each prompt a continuous realism score, then uses a semantics-preserving rewriting strategy to increase deploy likeness, and finally evaluates paired outputs with an external judge model. On a strategic role-playing dataset of 371 items, PRE raises average probe scores by 30\% after rewriting while maintaining task intent. Across state-of-the-art reasoning and general-purpose models, deploy-like prompts reliably change outcomes: honesty increases by 12.63%, deception decreases by -25.49%, and refusals rise by 12.82%, with Claude 4.1 Opus showing the largest single-model reduction in deception by 29.11%. These shifts are statistically significant under paired tests and correlate with the magnitude of probe-score gains, demonstrating that evaluation awareness is not only measurable but manipulable. Additionally, we provide a quantification of LLM evaluation awareness through an awareness elasticity score (AE), finding that AE scales with model size. Our findings highlight that reasoning models are evaluation-aware and are more prone to unsafe or deceptive outputs under perceived test conditions, underscoring the need for benchmark frameworks that explicitly account for prompt realism when assessing alignment.