DEPART: Hierarchical Multi-Agent System for Multi-Turn Interaction
Abstract
Large Language Models (LLMs) perform well on short-horizon tasks but struggle with long-horizon, multimodal scenarios that require multi-step reasoning, perception, and adaptive planning. We identify two key challenges in these settings: the difficulty of long-term coordination between planning and execution within single-agent architectures and the inefficiency of indiscriminate visual grounding. To address this, we propose \textbf{DEPART}, a hierarchical multi-agent framework that separates planning, perception, and execution across three specialized agents. These agents collaborate through an iterative loop: \textbf{D}ivide, \textbf{E}valuate, \textbf{P}lan, \textbf{A}ct, \textbf{R}eflect, and \textbf{T}rack, supporting dynamic task decomposition, selective vision use, and feedback-driven control. Evaluated on two web-based benchmarks, DEPART outperforms strong baselines, including agents enhanced with reinforcement learning, while improving efficiency through dynamic vision invocation. Our results highlight the value of modular, agent reasoning in complex multimodal tasks.