ChartAgent: A Multimodal Agent for Complex Visual Question Answering in Charts
Abstract
Recent multimodal LLMs have shown promise in chart-based visual question answering (Chart VQA), but their performance declines sharply on unannotated charts---those requiring precise visual interpretation rather than relying on textual shortcuts. To address this, we introduce ChartAgent, a novel agentic framework that explicitly performs visual reasoning directly within the chart's spatial domain. Unlike textual chain-of-thought reasoning, ChartAgent actively manipulates and interacts with chart images through chart-specialized actions such as drawing annotations, cropping relevant regions (e.g., segmenting pie slices, isolating bars of interest), localizing axes, and visually validating intermediate reasoning steps. Using a ReAct-style iterative loop, ChartAgent decomposes complex queries into clear visual subtasks and fulfills each subtask by selecting from a library of chart-specialized vision tools (e.g., segmentation, detection, localization). This iterative reasoning process closely aligns with human cognitive strategies for chart comprehension. The agent progressively interprets visual elements, visually validates tool outcomes, and dynamically adjusts its strategy when tool outcomes are inconclusive or insufficient. ChartAgent achieves state-of-the-art accuracy on the ChartBench and ChartX benchmarks, outperforming existing approaches by up to 16.07% overall and by as much as 17.31% on unannotated, numeric-intensive queries. This work is among the first to systematically demonstrate explicit, visually-grounded reasoning capabilities of tool-augmented multimodal agents in the chart domain.