Data Scaling Isn't Enough: Towards Improving Compositional Reasoning in Video-Language Models
Abstract
Research in Video-Language Models has focused on developing Video Foundation Models (ViFMs) that achieve strong zero-shot performance by scaling video-text pair datasets. Meanwhile, the compositional reasoning abilities of ViFMs have gained increasing attention, leading to a critical question: Does scaling video-text pairs consistently enhance compositional reasoning? Based on our finding that simply increasing the dataset size does not necessarily improve compositional reasoning, we explore whether compositional reasoning can be enhanced using a small, high-quality dataset instead of relying on dataset scaling. To this end, we focus on video scene graph (VidSG) datasets, which provide rich, structured relational information, and propose SGCR-Vid, a method designed to effectively leverage this information. To evaluate the effectiveness of SGCR-Vid, we apply it to two state-of-the-art ViFMs, demonstrating significant performance improvements on compositional reasoning benchmarks, using less than 0.5% of the pretraining data scale. Our results show that compositional reasoning can be effectively enhanced using an extremely small-scale dataset.