Keynote Speaker-Systematizing the Unusual: A Taxonomy-Driven Dataset for Vision–Language Model Reasoning About Edge Cases in Traffic
Abstract
One of the central challenges in developing robust vision–language models (VLMs) for real-world autonomy is their ability to recognize, interpret, and reason about rare and hazardous situations—so-called edge cases. Unlike routine traffic patterns, which are well-represented in large-scale datasets, these scenarios occur infrequently, are highly diverse, and often involve subtle contextual cues that challenge both object detection and semantic understanding.
EdgeScenes is a new dataset under development at the WISE Lab that directly targets this limitation. It systematically captures and organizes rare road situations using a structured, ontology-driven taxonomy of hazardous conditions spanning infrastructure anomalies, abnormal road-user behavior, foreign objects, environmental extremes, and complex interactions. The dataset is constructed from crowdsourced video footage and annotated with a rich multimodal schema covering over 300 fine-grained hazardous conditions, along with temporal extents and contextual tags. As such, it establishes a testbed specifically designed for evaluating VLMs on out-of-distribution, safety-critical driving scenarios.
This talk will present the motivation, design principles, and annotation methodology behind EdgeScenes, with particular emphasis on how taxonomy-based labeling enables systematic identification of edge cases and gaps. I will discuss key insights gathered during dataset construction, including patterns in real-world hazard emergence and challenges in visual–semantic grounding. Finally, I will report early experimental results using frontier VLMs, showing that while current models can detect a broad range of hazards beyond closed-vocabulary vision systems, they still exhibit severe limitations, showing both high false positive and false negative rates in recognizing road hazards. These findings point to an urgent need for progress in improving the reliability of VLMs to support trustworthy multimodal perception in autonomous systems.