Cross-Lingual Multimodal Retrieval-Augmented Generation for Open Question Answering in Tamil and Yoruba
Abstract
As large language models (LLMs) with retrieval-augmented generation (RAG) gain traction in multimodal knowledge-base question answering (KBQA), concerns about their transfer to low-resource languages (LRLs) remain unaddressed. We introduce LR-MMQA ( Low-Resource Multimodal Question Answering), a benchmark assessing multimodal multilingual retrieval and reasoning under the compounded challenges of LRLs. Using a state-of-the-art LLM, we translated the hardest questions from WebQA and MultimodalQA, benchmarks for multi-hop, multimodal, open-domain KBQA, creating a dataset that stresses cross-evidence aggregation, fine-grained grounding, and multi-hop inference, revealing the brittleness of current methods. Human post-editing was performed to ensure that the translated queries remained answerable and answers remained accurate. We also introduce XM-RAG, a state-of-the-art cross-lingual multimodal RAG pipeline optimized for KBQA in LRLs. Our findings expose significant biases and discrepancies in model answer accuracy and retrieval in LRLs. By releasing LR-MMQA and XM-RAG, we pinpoint current failure points and provide resources to evaluate and build upon said failures to ensure equitable multimodal KBQA.