Poster
in
Workshop: Workshop on Responsibly Building Next Generation of Multimodal Foundation Models

WikiDO: A New Benchmark Evaluating Cross-Modal Retrieval for Vision-Language Models

Pavan Kalyan Tankala ⋅ Piyush Pasi ⋅ Sahil Dharod ⋅ Azeem Motiwala ⋅ Preethi Jyothi ⋅ Aditi Chaudhary ⋅ Krishna Srinivasan

Keywords: generalization OOD dataset Benchmark Evaluation VLM

Project Page [ OpenReview]

Abstract

Cross-modal retrieval tasks, such as image-to-text and text-to-image, are crucial for evaluating vision-language models (VLMs). State-of-the-art VLMs like CLIP and BLIP-2 achieve impressive performance on benchmarks such as MSCOCO and Flickr30K. However, due to the high similarity between evaluation datasets (e.g., Flickr30K) and fine-tuning datasets (e.g., MSCOCO), these benchmarks are insufficient for assessing the out-of-distribution (OOD) generalization capabilities of VLMs. We introduce $\textbf{WIKIDO}$ (derived from $\textbf{Wiki}$pedia $\textbf{D}$iversity $\textbf{O}$bservatory), a new benchmark featuring 384K image-text pairs, alongside carefully curated, human-verified in-distribution (ID) and OOD test sets of size 3K each. Our evaluations show that BLIP-2 achieves a zero-shot recall at 1 (R@1) of 66\% on WIKIDO's OOD test set, compared to 81\% on MSCOCO and 95\% on Flickr30K. Fine-tuning on WIKIDO yields modest improvements, further demonstrating the benchmark's utility in testing OOD generalization. Our code and benchmark datasets will be released publicly.

Chat is not available.