ActorMind: Emulating Human Actor Reasoning for Role-Playing in Large Language-Audio Models
Abstract
Significant progress has been made in role-playing in LLMs. However, existing approaches are confined to the textual modality, neglecting speech, a predominant role in daily interactions, thereby limiting genuine role-playing. To bridge this gap, we conceptualize and benchmark role-playing in Large Language-Audio Models (LLAMs) through a framework called ActorMindBench, and we present a corresponding reasoning framework, called ActorMind. Specifically, (1) Role-Playing in LLAM enables LLAMs to deliver spontaneous responses with personalized verbal traits based on their role, the scene, and spoken dialogue. (2) ActorMindBench is a hierarchical benchmark comprises Utterance-Level content with 7,653 utterances, Scene-Level content with 313 scenes, and Role-Level content with 6 roles. Notably, we provide the corresponding data construction pipeline to facilitate user expansion. (3) ActorMind is an off-the-shelf, multi-agent CoT style reasoning framework that emulates how human actors perform in theaters. Concretely, ActorMind first reads its assigned role description via Eye Agent, then comprehends emotional cues within contextual spoken dialogues through Ear Agent. Subsequently, Brain Agent generates a descriptive emotional state, and finally, Mouth Agent delivers the scripts infused with corresponding emotion state. Experimental results on ActorMindBench demonstrate the effectiveness of ActorMind in enhancing role-playing in LLAMs. The project page is available at https://github.com/ActorMind/ActorMind.