Skip to yearly menu bar Skip to main content


Spotlight Poster

Unveiling Encoder-Free Vision-Language Models

Haiwen Diao · Yufeng Cui · Xiaotong Li · Yueze Wang · Huchuan Lu · Xinlong Wang


Abstract:

Existing large vision-language models (LVLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks. However, the vision encoders set a strong inductive bias in abstracting visual representation, e.g., resolution, aspect ratio, and semantic priors, which could impede the flexibility and efficiency of the LVLMs. Training pure LVLMs that accept the seamless vision and language inputs, i.e., without vision encoders, remains challenging and rarely explored. Empirical observations reveal that direct training without encoders results in slow convergence and large performance gaps. In this work, we bridge the gap between encoder-based and encoder-free models and present a simple yet effective training recipe towards pure LVLMs. Specifically, we unveil the key aspects of training encoder-free LVLMs efficiently via thorough experiments: (1) Bridging vision-language representation inside one unified decoder from scratch; (2) Enhancing visual recognition capability via extra supervision. With these strategies, we launch EVE, an encoder-free vision-language model that can be trained and forwarded efficiently. Notably, solely utilizing 35M publicly accessible data, EVE can impressively rival the encoder-based LVLMs of similar capacities across multiple vision-language benchmarks. It significantly outperforms the counterpart Fuyu-8B with mysterious training procedures and undisclosed training data. We believe EVE provides a transparent and efficient route for developing pure LVLMs and exploring decoder-only architecture across modalities. Code and pre-trained models will be publicly available.

Live content is unavailable. Log in and register to view live content