Invited Talk 4: Advances in Multimodal Video Understanding
Abstract
The field of multimodal learning has witnessed significant progress in recent years, mainly enabled by advances in contrastive and autoregessive learning techniques. This talk aims to discus some of those findings in the context of video understanding: First, I will discuss the abilities of Vision-Language models, namely with respect to spatial grounding and the problem of vision-language alignment. Second, the talk will show how VLMs can be used to analysis of biases in current video evaluation benchmarks and how to address those. Third, I will discuss recent works that extend the concept of vision-langauge learning to video-audio learning. The talk will close with an outlook toward the challenges and future directions in multimodal video understandinig to improve the efficiency and scalability of future systems.
Bio:
Hilde Kuehne is a Professor for Computer Vision and Multimodal Learning at the Tübingen AI Center and an affiliated professor at the MIT–IBM Watson AI Lab. Her research focuses on video understanding, learning without labels and the analysis of multimodalrepresentations. She has created several highly cited datasets and works on large-scale video analysis, including the HMDB51 dataset, which received both the ICCV 2021 Helmholtz Prize and the IEEE PAMI Mark Everingham Prize. She was GC for ICCV 2025 and currently serves as an Associate Editor for the IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI).