DCVLR: Data Curation for Vision Language Reasoning
Abstract
We propose a new data-centric competition that aims to advance the visual reasoning capabilities of vision-language models (VLMs) through instruction-tuning dataset curation. Participants are provided with a pool of 1 million image-text pairs and tasked with generating a small (1K) or large (10K) instruction-tuning dataset using any method of their choice. Submissions will be evaluated by fine-tuning a fixed VLM (Molmo) on the curated data and measuring performance on VMCBench, a newly released benchmark composed of multiple-choice visual reasoning questions spanning six diverse datasets.The competition provides all necessary resources, including the image-text pool, fine-tuning scripts, evaluation code, and baselines generated using GPT-4o and Claude, as well as 400 USD GPU compute from Lambda Labs. The evaluation metric is accuracy, and all training and evaluation will be reproduced by organizers on standardized infrastructure. This challenge reframes data curation as the primary variable for scientific investigation, with implications for adapting foundation models to real-world domains such as education, biomedicine, and scientific reasoning.We aim to foster broad participation across academia and industry, democratizing model adaptation by focusing on data quality rather than computational scale.