REVEAL – Reasoning and Evaluation of Visual Evidence through Aligned Language
Abstract
The rapid development of generative models has heightened the difficulty of detecting and understanding visual forgeries, requiring robust frameworks for image forgery detection that offer reasoning as well as localisation. While current approaches tend to use supervised training for specific manipulations or anomaly detection within the embedding space, achieving generalisation across domains remains difficult. We conceptualise this forgery detection challenge as a prompt-driven visual reasoning task, utilising the semantic alignment abilities of large vision-language models. We introduce a framework, ‘REVEAL‘ (Reasoning and Evaluation of Visual Evidence through Aligned Language), which integrates broad guidelines. Additionally, we propose two related methods: (1) Holistic Scene-level Evaluation, which aics, semantics, perspective, and realism of the entire image; and (2) Region-wise anomaly detection, which divides the image into grid regions for zoomed-in analysis. Experiments are conducted on datasets from various domains (Photoshop, DeepFake, and AIGC editing). We compare Vision Language Models with competitive baselines and evaluate the reasoning they provide. DOI: https://doi.org/10.48550/arXiv.2508.12543