Skip to yearly menu bar Skip to main content


Poster

UDKAG: Augmenting Large Vision-Language Models with Up-to-Date Knowledge

Chuanhao Li · Zhen Li · Chenchen Jing · Shuo Liu · Wenqi Shao · Yuwei Wu · Ping Luo · Yu Qiao · Kaipeng Zhang


Abstract: Large vision-language models (LVLMs) are ignorant of the up-to-date knowledge, such as LLaVA series, because they cannot be updated frequently due to the large amount of resources required, and therefore fail in many cases. For example, if a LVLM was released on January 2024, and it wouldn't know the detailed plot of the new movie Dune 2, such as "the dead of an important duel", because this movie was not yet released when the LVLM was trained. To solve the problem, a promising solution is to provide LVLMs with up-to-date knowledge via internet search during inference, i.e., internet-augmented generation (IAG), which is already integrated in some closed-source commercial LVLMs such as GPT-4V. However, the specific mechanics underpinning them remain a mystery. In this paper, we propose a plug-and-play framework, for augmenting existing LVLMs in handling visual question answering (VQA) about up-to-date knowledge, dubbed UDKAG. A hierarchical filtering model is trained to effectively and efficiently find the most helpful content from the websites returned by a search engine to prompt LVLMs with up-to-date knowledge. To train the model and evaluate our framework's performance, we propose a pipeline to automatically generate news-related VQA samples to construct a dataset, dubbed UDK-VQA. A multi-model voting mechanism is introduced to label the usefulness of website/content for VQA samples to construct the training set. For the test set, we perform manual screening to ensure the correctness of test samples. Experimental results demonstrate significant improvements of our framework over LVLMs, outperforming the self-contained IAG-capable GPT-4V by $\sim$25\% in accuracy on UDK-VQA test set.

Live content is unavailable. Log in and register to view live content