BEAR: Benchmarking Multimodal Language Models for Atomic Embodied Reasoning Abilities
Abstract
Embodied reasoning abilities refer to the capabilities for agents to perceive, comprehend, and interact effectively with the physical world. While multimodal large language models (MLLMs) show promise as embodied agents, a thorough and systematic evaluation of their embodied reasoning capabilities remains underexplored, as existing benchmarks primarily focus on isolated domains such as planning or spatial understanding. To bridge this gap, we propose BEAR, a comprehensive and fine-grained benchmark designed to evaluate MLLM's atomic embodied reasoning abilities. BEAR comprises 4,469 interleaved video–image–text entries across 14 domains in 6 categories, including tasks from low-level pointing, trajectory understanding, spatial reasoning, to high-level planning. Evaluation results of 15 state-of-the-art MLLMs reveal their persistent limitations across all domains of embodied reasoning.