Beyond Accuracy: A Diagnostic Protocol for Fairly Evaluating Multimodal Reasoning
Abstract
Current benchmarks for Multimodal Large Language Models (MLLMs) rely on single-accuracy scores, a metric that is fundamentally flawed for subjective tasks like emotion recognition. This paradigm creates an "Intelligence-Accuracy Paradox," where models with sophisticated reasoning about ambiguous human communication are penalized for not conforming to a single, oversimplified ground-truth label, while less intelligent models that exploit dataset biases can achieve higher scores. This paper argues that high accuracy often masks a "hidden failure" on complex, ambiguous instances. To address this, we propose a new, two-stage protocol that is both diagnostic and evaluative. Stage 1 acts as a diagnostic, using a phenomenon we term Dominant Modality Override (DMO), where one modality’s high-confidence signal hijacks the final decision to automatically partition a dataset into unambiguous and conflict-rich samples. This diagnosis enables Stage 2, a fairer evaluation where these partitions are assessed differently: unambiguous samples are scored on accuracy, while conflict-rich samples are evaluated on the quality of their reasoning using metrics like clue-based faithfulness and set-basedplausibility. This protocol provides a fairer, more faithful "report card" of a model’s true capabilities, rewarding intelligent reasoning over brittle pattern matching.