Poster
in
Workshop: Multimodal Algorithmic Reasoning Workshop

Aha Moment Revisited: Are Vision Language Models Truly Capable of Self-verification in Inference Scaling?

Mingyuan Wu ⋅ Meitang Li ⋅ Jingcheng Yang ⋅ Jize Jiang ⋅ Kaizhuo Yan ⋅ Zhaoheng Li ⋅ Hanchao Yu ⋅ Minjia Zhang ⋅ Klara Nahrstedt

2025 Poster
in
Workshop: Multimodal Algorithmic Reasoning Workshop

Abstract

Inference-time techniques such as decoding-time scaling and self-refinement have been shown to substantially improve reasoning in large language models (LLMs), driven by emergent self-correction and self-verification behaviors often elicited through reinforcement learning (RL). In this work, we investigate whether these inference-time scaling methods similarly benefit vision-language models (VLMs), especially those fine-tuned with RL. Through extensive evaluation, we find that while strategies like majority vote and best-of-N with self-verification enhance VLM performance, majority vote significantly outperforms verification-centric ones. Furthermore, inference time scaling behaviors commonly associated with RL-tuned models, such as the “aha moment,” do not yield consistent performance gains. Our analysis identifies a key limitation: current RL-trained VLMs exhibit weak self-verification across both visual and textual modalities, limiting the effectiveness of inference-time scaling.

Chat is not available.