Timezone: »

Localized Symbolic Knowledge Distillation for Visual Commonsense Models
Jae Sung Park · Jack Hessel · Khyathi Chandu · Paul Pu Liang · Ximing Lu · Peter West · Youngjae Yu · Qiuyuan Huang · Jianfeng Gao · Ali Farhadi · Yejin Choi

Wed Dec 13 03:00 PM -- 05:00 PM (PST) @ Great Hall & Hall B1+B2 #404

Instruction following vision-language (VL) models offer a flexibleinterface that supports a broad range of multimodal tasks in a zero-shot fashion.However, interfaces that operate on full images do not directly enable the user to“point to" and access specific regions within images. This capability is importantnot only to support reference-grounded VL benchmarks, but also, for practicalapplications that require precise within-image reasoning. We build LocalizedVisual Commonsense model which allows users to specify (multiple) regions-as-input. We train our model by sampling localized commonsense knowledgefrom a large language model (LLM): specifically, we prompt a LLM to collectcommonsense knowledge given a global literal image description and a localliteral region description automatically generated by a set of VL models. Thispipeline is scalable and fully automatic, as no aligned or human-authored imageand text pairs are required. With a separately trained critic model that selectshigh quality examples, we find that training on the localized commonsense corpusexpanded solely from images can successfully distill existing VL models to supporta reference-as-input interface. Empirical results and human evaluations in zero-shotsettings demonstrate that our distillation method results in more precise VL modelsof reasoning compared to a baseline of passing a generated referring expression.

Author Information

Jae Sung Park (University of Washington)
Jack Hessel (Samaya.ai)
Khyathi Chandu (Allen Institute for Artificial Intelligence)
Paul Pu Liang (Carnegie Mellon University)
Ximing Lu (Department of Computer Science, University of Washington)
Peter West (University of Washington, Seattle)
Youngjae Yu (Yonsei University)
Qiuyuan Huang (Microsoft Research AI)
Jianfeng Gao (Microsoft Research, Redmond, WA)
Ali Farhadi (University of Washington, Allen Institute for Artificial Intelligence)
Yejin Choi (University of Washington)

More from the Same Authors