Skip to yearly menu bar Skip to main content

Workshop: I Can’t Believe It’s Not Better (ICBINB): Failure Modes in the Age of Foundation Models

Zero-shot capabilities of visual language models with prompt engineering for images of animals

Andrea Tejeda Ocampo · Eric C. Orenstein · Kakani Katija


Visual Language Models have exhibited impressive performance on new tasks in a zero-shot setting. Language queries enable these large models to classify or detect objects even when presented with a novel concept in a shifted domain. We explore the limits of this capability by presenting Grounding DINO with images and concepts from field images of marine and terrestrial animals. By manipulating the language prompts, we found that the embedding space but does necessarily encode Latinate scientific names, but still yields potentially useful localizations due to a strong sense of general objectness. Grounding DINO struggled with objects in a challenging underwater setting, but improved when fed expressive prompts that explicitly described morphology. These experiments suggest that large models still have room to grow in domain use-cases and illuminate avenues for strengthening their understanding of shape to further improve zero-shot performance.

Chat is not available.