Poster
in
Affinity Event: Muslims in ML

ColFlor: Towards BERT-Size Vision-Language Document Retrieval Models

Ahmed Masry ⋅ Enamul Hoque

Keywords: Multimodal Document Retrieval. Information Retrieval Document Understanding Vision Large Language Models

Project Page [ OpenReview]

Abstract

Traditional document retrieval systems for PDFs, charts, and infographics rely heavily on Optical Character Recognition (OCR) pipelines to extract textual content, a process that is both error-prone and resource-intensive. Recent advancements in multimodal models like ColPali have enabled OCR-free retrieval by processing documents directly as images, but their large size (three billion parameters) makes them computationally expensive and impractical for large-scale applications. To address this limitation, we introduce ColFlor, an efficient OCR-free visual document retrieval model with only 174 million parameters. ColFlor achieves comparable performance to ColPali on text-rich English documents—with only a 1.8% decrease in performance (measured by NDCG@5 metric)—while being significantly faster in image encoding (5.25 times faster) and query encoding (9.8 times faster). This makes OCR-free document retrieval systems more cost-effective for large-scale applications and more accessible to users with limited computational resources.

Chat is not available.