Explainable AI-Generated Image Detection RewardBench
Abstract
Generative models now power large portions of the creative pipeline, but the same capabilities enable image forgeries that erode provenance, mislead audiences, and threaten creators’ rights. Protective AI therefore needs detectors that explain why an image is real or AI-generated in language people can trust, and crucially, reliable judges that can score the quality of those explanations at scale. We introduce XAIGID-RewardBench, a domain-specific benchmark for evaluating multimodal reward models (“MLLM-as-a-judge”) on the task of assessing explanations for AI-generated image detection. The benchmark contains ~3000 human-verified triplets (image, two competing explanations), spanning recent generators and real images, and is designed to stress artifact-aware, evidence-grounded reasoning. On XAIGID-RewardBench, the best current model attains 88.76\% accuracy on clear-winner cases, while human inter-annotator agreement reaches 98.30\%, revealing a sizable reliability gap. We analyze frequent failure modes (e.g., generalizing over localized artifacts, unsupported claims) and discuss how better judges can strengthen content provenance pipelines and platform governance. Code and data will be released upon acceptance.