Poster
Right this way: Can VLMs Guide Us to See More to Answer Questions?
Li Liu · Diji Yang · Sijia Zhong · Kalyana Suma Sree Tholeti · Lei Ding · Yi Zhang · Leilani Gilpin
East Exhibit Hall A-C #1806
In question-answering scenarios, humans can assess whether the available information is sufficient and seek additional information if necessary, rather than providing a forced answer. In contrast, Vision Language Models (VLMs) typically generate direct, one-shot responses without evaluating the sufficiency of the information. To investigate this gap, we identify a critical yet challenging task in the Visual-Question-Answering (VQA) scenario: can VLMs indicate how to adjust an image when the visual information is insufficient to answer a question? This capability is especially valuable for assisting visually impaired individuals. To evaluate this capability of current VLMs, we introduce a human-labeled dataset as a benchmark for this task. Additionally, we present an automated pipeline that generates synthetic training data by simulating "where to know" scenarios. Our empirical results demonstrate significant performance improvements when the synthetic data is used to fine-tune mainstream VLMs. Our study highlights the potential to bridge the gap between human-like information assessment and acquisition process.
Live content is unavailable. Log in and register to view live content