MIRAGE: A Benchmark for Multimodal Information-Seeking and Reasoning in Agricultural Expert-Guided Conversations
Abstract
We introduce MIRAGE, a new benchmark for multimodal expert-level reasoning and decision-making in consultative interaction settings. Designed for the domain of agriculture, MIRAGE captures the full complexity of expert consultations by combining natural user queries, expert-authored responses, and image-based context, offering a high-fidelity benchmark for evaluating models on grounded reasoning, clarification strategies, and long-form generation in a real-world, knowledge-intensive domain. Grounded in over 35,000 real user-expert interactions, and curated through a carefully designed multi-step pipeline, MIRAGE spans diverse crop health, pest diagnosis, and crop management scenarios. The benchmark includes more than 7,000 unique biological entities, covering plant species, pests, and diseases, making it one of the most taxonomically diverse benchmarks available for vision-language models in real-world expert-guided domains. Unlike existing benchmarks that rely on well-specified user inputs, MIRAGE features underspecified, context-rich scenarios, requiring models to infer latent knowledge gaps and either proactively guide the interaction or respond. Our benchmark comprises two core components. The Single-turn Challenge to reason over a single user turn and image set, identify relevant entities, infer causal explanations, and generate actionable recommendations; and a Multi-Turn challenge for dialogue state tracking, goal-driven generation, and expert-level conversational decision-making. We evaluate more than 20 closed and open-source frontier vision-language models (VLMs), using three reasoning language models as evaluators, highlighting the significant challenges posed by MIRAGE in both single-turn and multi-turn interaction settings. Even the advanced GPT4.1 and GPT4o models achieve 44.6% and 40.9% accuracy, respectively, indicating significant room for improvement.