Poster
in
Workshop: Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling

Human-Centric Framework for Large Multimodal Models Evaluation

Shaina Raza ⋅ Aravind Narayanan ⋅ Vahid Reza Khazaie ⋅ Ashmal Vayani ⋅ Mukund Chettiar ⋅ Deval Pandya

Project Page [ OpenReview]

Abstract

Large multimodal models (LMMs) have been widely tested on tasks like visual question answering (VQA), image captioning, and grounding, but lack rigorous evaluation for alignment with human-centered (HC) values such as fairness, ethics, and inclusivity. To address this gap, we introduce \textbf{HumaniBench}, a novel benchmark of 32,000 real-world image-question pairs and an evaluation suite. Labels are generated via an AI-assisted pipeline and validated by experts. HumaniBench assesses LMMs across seven key alignment principles: fairness, ethics, empathy, inclusivity, reasoning, robustness, and multilinguality, through open-ended and closed-ended VQA tasks. Grounded in AI ethics and real-world needs, these principles provide a holistic lens for societal impact. Benchmarking results on different LMM shows that proprietary models generally lead in reasoning, fairness, and multilinguality, while open-source models excel in robustness and grounding. Most models struggle to balance accuracy with ethical and inclusive behavior. HumaniBench offers a rigorous testbed to diagnose limitations, and promote responsible LMM development. Code and data are available for reproducibility.

Chat is not available.