Keynote Speaker-Toward Efficient and Reliable Vision–Language Models for Real-World Autonomous Systems
Abstract
Vision language models (VLMs) are becoming central components in autonomous systems, enabling capabilities such as scene understanding, instruction following, and high-level decision support. However, their deployment in real-world environments remains constrained by two fundamental challenges: the need for reliable visual perception and the need for computational efficiency under strict latency and resource budgets. In this talk, I will present a unified research trajectory that addresses these challenges through new designs for knowledge-efficient, token-efficient, and expert-efficient multimodal models. I will showcase three complementary advances: HAWAII, a hierarchical distillation framework that transfers the strengths of multiple vision experts into a single lightweight encoder for improved visual understanding; LEO-MINI, an efficient multimodal model that dramatically reduces visual token redundancy using conditional token reduction and multi-modal expert routing; and Leo, a principled mixture-of-vision-encoders design that reveals simple, effective fusion strategies for high-resolution perception. Together, these contributions show how to achieve strong visual reasoning under tight computational and latency constraints offering practical, deployable VLM solutions for autonomous systems operating in complex, dynamic, and safety-critical environments.