Beyond Accuracy: Evaluating Multimodal Mathematical and Scientific Reasoning Through Error Analysis and Self-Correction
Abstract
While contemporary large vision-language models achieve impressive performance on standard benchmarks, their reasoning depth remains poorly understood. We evaluate multimodal mathematical and scientific reasoning through comprehensive error analysis and self-correction assessment using challenging bilingual (English and Hindi) problems from the Joint Entrance Examination (JEE) Advanced. Our evaluation of eleven models reveals that while frontier models achieve 76.8-83.9% accuracy, open-source alternatives reach only 10.9-50.9%, a significant performance gap not observed on existing benchmarks like MMMU. We also observe instruction-following failures and adherence to English despite input prompts in other languages on SoTA models. Most critically, our self-correction pipeline shows that models can only correct less than 10% of their responses despite 30-79% error detection and 31-55% pass@k accuracy improvements. Finally, these findings indicate that the cognitive demands of sequential self-reflection exceed current model capabilities. We publicly release our codebase and data: https://anonymous.4open.science/r/mmJEE-Eval-D14F