Pawgaze: A Benchmark for Fine-Grained Multimodal Analysis of Canine Behavior
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated remarkable progress in zero-shot understanding of diverse inputs such as video, audio, and text. But can they accurately understand complex animal behavior? This challenge stems from the lack of comprehensive datasets that capture real-world animal behaviors, combining visual and auditory cues with insights into physical conditions and emotional states. To address this gap, we present Pawgaze, a novel benchmark for fine-grained analysis of dog activities, comprising 7,120 question–answer pairs across 923 videos. The benchmark includes real-world dog videos paired with synchronized audio–visual, five-way multiple-choice questions requiring frame-level reasoning, interpretation of behavioral cues, and understanding of human–dog interactions. We introduce a scalable, LLM-based automated question-answer generation pipeline that is facilitated by domain expert-driven insights developed in collaboration with canine behavior experts. MLLM benchmarking is conducted using various proprietary models. Experimental results and analyses indicate that closed-source MLLMs demonstrating superior zero-shot performance in multimodal understanding of canine-centered behaviors but rely heavily on prior knowledge. A detailed failure analysis highlights the challenges and opportunities for improvement. Pawgaze paves the way for extending VLM capabilities beyond traditional scene understanding tasks, with promising applications in pet-care robotics, animal health, and behavior modeling. We provide a link to the anonymized dataset: https://huggingface.co/datasets/pawgaze/pawgaze.