Foundation Models for Embodied Agents
Abstract
This challenge invites participants to enhance Large Language Models (LLMs) for embodied reasoning through our standardized Embodied Agent Interface evaluation protocol. The framework systematically evaluates critical embodied reasoning capabilities: goal interpretation (understanding objectives and grounding to environment states), subgoal decomposition (breaking down complex goals), action sequencing (planning action sequences), and transition/world modeling (modeling world state changes through actions).Despite growing interest in using LLMs for robotics and agent planning, current evaluations lack standardization and fail to pinpoint fine-grained reasoning failures. Our challenge builds a unified evaluation approach to task formulation, input/output structures, and evaluation metrics by utilizing the well-established BEHAVIOR and VirtualHome simulators, enhanced with detailed annotations including Linear Temporal Logic (LTL) goal specifications and comprehensive error analysis.Unlike typical evaluations that just report an overall success rate, leaving us in the dark about which specific abilities LLMs struggle with—our framework uses fine-grained metrics that examine both whether proposed actions could actually work in practice and if they truly accomplish the intended goals. Our evaluation system breaks down each reasoning component into separate modules, giving us a clear picture of exactly where and how these models succeed or fail. Baseline results from state-of-the-art LLMs reveal significant performance gaps and motivate further innovation. The competition aims to advance the understanding of how LLMs reason in embodied environments, promote the development of robust and interpretable LLM AI agents, and foster collaboration between the language modeling and robotics communities. More info at https://neurips25-eai.github.io/.
Schedule
|
11:00 AM
|
|
11:05 AM
|
|
12:15 PM
|