Skip to yearly menu bar Skip to main content


Poster

Image2Struct: A Benchmark for Evaluating Vision-Language Models in Extracting Structured Information from Images

Josselin Roberts · Tony Lee · Chi Heem Wong · Michihiro Yasunaga · Yifan Mai · Percy Liang


Abstract:

A good benchmark for vision-language models (VLMs) should 1) be automatic, 2) present realistic tasks, 3) use fresh and real data, and 4) be difficult to game. This paper introduces Image2Struct, a benchmark for evaluating vision-language models in practical tasks of extracting structured information from images. In our tasks, VLMs are prompted to generate the underlying structured information (i.e., code) from an input image. The code can be compiled and the output image is evaluated against the input image to produce a score. This round-trip evaluation allows us to quantitatively evaluate VLMs on complex tasks with multiple correct answers. We create a pipeline that downloads fresh, user-submitted data from active online communities upon execution, evaluates the VLMs shortly, and produce a leaderboard. We introduce three tasks in the domain of web pages, LaTeX, and music and two new metrics that allow efficient and automatic comparison between a pair of images. Our initial run on twelve of the most popular VLMs shows that our preferred metric correlates with structural similarity between images. The VLMs produce a range of scores for each subtask, indicating that Image2Struct can differentiate between the performances of the VLMs. There is also a range of scores for the best performing model for each domain (e.g., 0.402 on sheet music vs 0.822 on LaTeX equations), indicating that Image2Struct contains tasks of varying difficulty. For transparency, we release the full results at https://crfm.stanford.edu/helm/image2structure/v1.0.0/.

Live content is unavailable. Log in and register to view live content