Poster
in
Workshop: Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling

Beyond Accuracy: A Diagnostic Protocol for Fairly Evaluating Multimodal Reasoning

Shohreh Ghorbani · Chenyu Zhang · Minsol (Michelle) Kim · Jingyao Wu

Project Page [ Poster] [ OpenReview]

Abstract

Current benchmarks for Multimodal Large Language Models (MLLMs) rely on single-accuracy scores, a metric that is fundamentally flawed for subjective tasks like emotion recognition. This paradigm creates an "Intelligence-Accuracy Paradox," where models with sophisticated reasoning about ambiguous human communication are penalized for not conforming to a single, oversimplified ground-truth label, while less intelligent models that exploit dataset biases can achieve higher scores. This paper argues that high accuracy often masks a "hidden failure" on complex, ambiguous instances. To address this, we propose a new, two-stage protocol that is both diagnostic and evaluative. Stage 1 acts as a diagnostic, using a phenomenon we term Dominant Modality Override (DMO), where one modality’s high-confidence signal hijacks the final decision to automatically partition a dataset into unambiguous and conflict-rich samples. This diagnosis enables Stage 2, a fairer evaluation where these partitions are assessed differently: unambiguous samples are scored on accuracy, while conflict-rich samples are evaluated on the quality of their reasoning using metrics like clue-based faithfulness and set-basedplausibility. This protocol provides a fairer, more faithful "report card" of a model’s true capabilities, rewarding intelligent reasoning over brittle pattern matching.

Chat is not available.