Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models
Hyunsik Chae · Seungwoo Yoon · Chloe Yewon Chun · Gyehun Go · Yongin Cho · Gyeongmin Lee · Ernest Ryu
Keywords:
Benchmarking
vision language models
Large language models
Large Multimodal Models
Vision-Language Reasoning
visual geometry
Abstract
Recent Vision Language Models (VLMs) have demonstrated impressive multimodal comprehension and reasoning capabilities, but they often struggle with trivially simple visual tasks. In this work, we introduce the Atomic Visual Skills Benchmark (AVSBench) to evaluate whether VLMs possess capabilities to understand basic geometric features, which we refer to as atomic visual skills. Specifically, we systematically categorize the atomic visual skills and handcraft a set of 5,073 diverse questions designed to assess each individual atomic visual skill. Using AVSBench, we evaluate the current leading VLMs and find that they struggle with most of these atomic visual skills that are obvious to humans.
Chat is not available.
Successful Page Load