Skip to yearly menu bar Skip to main content

Workshop Sun, Nov 30, 2025 • 11:00 AM – 6:00 PM PST Don Alberto 4

Vision Language Models: Challenges of Real World Deployment

Mozhgan Nasr Azadani · Yimu Wang · Krzysztof Czarnecki · Negar Arabzadeh · Richard H Bai · Lukas Schmid

Project Page [ OpenReview]

Abstract

Vision language models (VLMs) have demonstrated remarkable capabilities in integrating visual perception with natural language understanding, powering applications such as multimodal assistants, robotics, autonomous systems, and accessibility tools. However, their real-world deployment faces significant challenges in efficiency, scalability, and reliability. This workshop will bring together researchers and practitioners from academia and industry to highlight cutting-edge research, systems-level optimizations, and evaluation methodologies that are often overlooked yet pivotal for robust real-world integration. Efficiency, robustness, and reliability will be emphasized as core design principles, essential to advancing VLMs from experimental systems to dependable deployed technologies. By convening researchers at the intersection of multimodal learning, efficient inference and training, robustness and uncertainty estimation, and large-scale systems design, the workshop aims to establish concrete pathways toward building VLMs that can operate reliably under practical constraints. We hope this workshop will serve as a venue for exchanging insights on model design, efficiency techniques, and robustness evaluation that bridge the gap between research and real-world systems.

Video

Chat is not available.

Schedule

Timezone: America/Los_Angeles

11:00 AM

Opening remarks

Video

11:10 AM

Keynote: Dr. Roozbeh Mottaghi - Meta FAIR/UWashington

Video

11:40 AM

Keynote: Dr. Jiachen Li - University of California, Riverside

Video

12:10 PM

Keynote: Dr. Krzysztof Czarnecki - University of Waterloo

Video

12:30 PM

Keynote: Dr. Elahe Arani - Wayve

Video

1:00 PM

Poster session

→ Eureka: Intelligent Feature Engineering for Enterprise AI Cloud Resource Demand Prediction

Hangxuan Li · Renjun Jia · Xuezhang Wu · zeqi zheng · Yunjie Qian · Lily (Xianling) Zhang

→ AMVICC: A Novel Benchmark for Cross-Modal Failure Mode Profiling for VLMs and IGMs

Aahana Basappa · Pranay Goel · Anusri Karra · Anish Karra · Asa Gilmore · Kevin Zhu

→ Do Vision–Language Models Understand Visual Persuasiveness?

Gyuwon Park

→ Advancing Autonomous VLM Agents via Variational Subgoal-Conditioned Reinforcement Learning

Qingyuan Wu · Jianheng Liu · Jianye Hao · Jun Wang · Kun Shao

→ Don’t Lag, RAG: Training-Free Adversarial Detection Using RAG

Roie Kazoom · Raz Lapid · Moshe Sipper · Ofer Hadar

→ MetaTPT: Meta Test-time Prompt Tuning for Vision-Language Models

Yuqing Lei · Yingjun Du · Yawen Huang · Xiantong Zhen · Ling Shao

→ Efficient Vision-Language Reasoning via Adaptive Token Pruning

Xue li · Xiaonan Song

→ Scene Understanding via Scene Representation Generation with Vision-Language Models

Yuan Chen · Peng Shi

→ Efficient Inference Scaling for Safety Assurance

Ruizhong Qiu · Gaotang Li · Ting-Wei Li · Tianxin Wei · Jingrui He · Hanghang Tong

→ Seg-R1: Segmentation Can Be Surprisingly Simple with Reinforcement Learning

Zuyao You

→ From Vision to Action: Enabling Real-World Agentic VLMs

Atchuta Ram Aravilli

→ A Comprehensive Survey of Multimodal LLMs for Scientific Discovery

Liang Yan · Xu Jiang · Jian Ma · Yuhang Liu · Tian Bian · Qichao Wang · Abhishek Basu · Yu Rong · Tingyang Xu · Pengcheng Wu · Le Song · Imran Razzak · Junchi Yan · zengfeng Huang · Yutong Xie

→ UpstreamQA: A Modular Framework for Explicit Reasoning on Video Question Answering Tasks

Jason Nguyen · Alexander Chang · Ameet Rao · Ishaan Kumar · Erin Tan

→ MedVCTP: Improving Accuracy and Explainability in Medical Visual Reasoning

Aman Syed · Siwon Ryu · Nayan Saxena · Kevin Zhu

1:40 PM

Keynote: Dr. Yan Wang - NVIDIA

Video

2:10 PM

Panel Discussion: Causal and Temporal Reasoning for Video Understanding

Krzysztof Czarnecki

Video

2:40 PM

Keynote: Dr. Behrad Toghi - General Motors

Video

3:10 PM

Coffee Break

3:40 PM

Oral presentations

3:40 PM

From Scenes to Semantics: PersianCLEVR for Bilingual 3D Visual Reasoning

Kianoosh Vadaei · Melika Shirian · Arshia Hemmat · Mohammad Heydari · Ali Mamanpoosh · Afsaneh Fatemi

Video

3:40 PM

VisAidMath: Benchmarking Visual-Aided Mathematical Reasoning

Jingkun Ma · Runzhe Zhan · Yang Li · Di Sun · Hou Pong (Ken) Chan · Lidia Chao · Derek Wong

Video

3:40 PM

Closed-Task Validation: A More Robust and Efficient Proxy for Guiding VLM Training

Enci Zhang · Zongqiang ZHANG · Jiahao Xie · Ruiqi Lu · Boyan Zhou · Cheng Yang

Video

4:10 PM

Closing remarks and best paper award

Video

4:20 PM

Poster Sessions II