Skip to yearly menu bar Skip to main content

Workshop: Synthetic Data Generation with Generative AI

Evaluating VLMs for Property-Specific Annotation of 3D Objects

Rishabh Kabra · Loic Matthey · Alexander Lerchner · Niloy Mitra

Keywords: [ 3d objects ] [ semantic annotation ] [ physical properties ] [ vision language models ]


3D objects, which often lack clean text descriptions, present an opportunity to evaluate pretrained vision language models (VLMs) on a range of annotation tasks---from describing object semantics to physical properties. An accurate response must take into account the full appearance of the object in 3D, various ways of phrasing the question/prompt, and changes in other factors that affect the response. We present a method, to marginalize over arbitrary factors varied across VLM queries, which relies on the VLM’s scores for sampled responses. We first show that this aggregation method can outperform a language model (e.g., GPT4) for summarization, for instance avoiding hallucinations when there are contrasting details between responses. Secondly, we show that aggregated annotations are useful for prompt-chaining; they help improve downstream VLM predictions (e.g., of object material when the object’s type is specified as an auxiliary input in the prompt). Such auxiliary inputs allow ablating and measuring the contribution of visual reasoning over language-only reasoning. Using these evaluations, we show that VLMs approach the quality of human-verified annotations on both type and material inference on the large-scale Objaverse dataset.

Chat is not available.