Skip to yearly menu bar Skip to main content


Poster

Efficient Multi-modal Models via Stage-wise Visual Context Compression

Jieneng Chen · Luoxin Ye · Ju He · Zhaoyang Wang · Daniel Khashabi · Alan Yuille

[ ]
Thu 12 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract:

While significant advancements have been made in compressed representations for text embeddings in large language models (LLMs), the compression of visual tokens in multi-modal LLMs (MLLMs) has remained a largely overlooked area. In this work, we present the first study on the analysis of redundancy concerning visual tokens and efficient training within these models. Our initial experiments show that eliminating up to 70% of visual tokens at the testing stage by simply average pooling only leads to a minimal 3% reduction in visual question answering accuracy on the GQA benchmark, indicating significant redundancy in visual context. Addressing this, we introduce Visual Context Compressor, which reduces the number of visual tokens during training to enhance training efficiency without sacrificing performance. To further avoid the information loss brought by the compression on visual tokens while maintaining training efficiency, we develop stage-wise MLLM training, which incorporates stage-wise visual context compression to progressively compress the visual tokens from heavily to lightly, and finally no compression at the end of training, yielding no loss of information when testing. Extensive experiments demonstrate that our approach not only improves the performance of MLLMs but also significantly reduces training costs.

Live content is unavailable. Log in and register to view live content