Visual Abstract Thinking Empowers Multimodal Reasoning
Abstract
Images usually convey richer detail than text, but often include redundant information which potentially downgrades the performance of multimodal reasoning tasks. When faced with lengthy or complex messages, humans tend to employ abstract thinking to convert them into simpler and more concise abstracts. Inspired by this cognitive strategy, we design Visual Abstract Thinking (VAT), a novel thinking paradigm that prompts Multimodal Large Language Models (MLLMs) with visual abstract instead of explicit verbal thoughts or elaborate guidance, permitting a more concentrated visual reasoning mechanism. VAT encourages models to focus on more essential visual elements and concepts by undermining visual redundancy comparing with explicit thinking methods such as Chain-of-thought (CoT) and tool-augmented approaches. Results show that VAT consistently empowers different MLLMs in multiple visual perception and reasoning tasks. VAT achieves an average gain of 8.72% over GPT-4o baseline, surpassing the gain of CoT (5.46%), demonstrating that VAT better enhances visual reasoning abilities for MLLMs.