WikiDO: A New Benchmark Evaluating Cross-Modal Retrieval for Vision-Language Models
Pavan Kalyan Tankala · Piyush Pasi · Sahil Dharod · Azeem Motiwala · Preethi Jyothi · Aditi Chaudhary · Krishna Srinivasan
Abstract
Cross-modal retrieval tasks, such as image-to-text and text-to-image, are crucial for evaluating vision-language models (VLMs). State-of-the-art VLMs like CLIP and BLIP-2 achieve impressive performance on benchmarks such as MSCOCO and Flickr30K. However, due to the high similarity between evaluation datasets (e.g., Flickr30K) and fine-tuning datasets (e.g., MSCOCO), these benchmarks are insufficient for assessing the out-of-distribution (OOD) generalization capabilities of VLMs. We introduce $\textbf{WIKIDO}$ (derived from $\textbf{Wiki}$pedia $\textbf{D}$iversity $\textbf{O}$bservatory), a new benchmark featuring 384K image-text pairs, alongside carefully curated, human-verified in-distribution (ID) and OOD test sets of size 3K each. Our evaluations show that BLIP-2 achieves a zero-shot recall at 1 (R@1) of 66\% on WIKIDO's OOD test set, compared to 81\% on MSCOCO and 95\% on Flickr30K. Fine-tuning on WIKIDO yields modest improvements, further demonstrating the benchmark's utility in testing OOD generalization. Our code and benchmark datasets will be released publicly.
Chat is not available.
Successful Page Load