Topological Alignment of Shared Vision-Language Embedding Space
Abstract
Vision-Language Models (VLMs) have shown strong performance in multimodal tasks by aligning image and text representations through contrastive learning. However, their cross-modal capabilities are predominantly biased toward English due to the lack of high-quality multilingual multimodal data. Although recent multilingual extensions of VLMs have attempted to bridge this gap through knowledge distillation and continual learning, they focus on instance-level alignment and fail to preserve the global structure of the embedding space.In this paper, we propose ToMCLIP (Topological Alignment for Multilingual CLIP), a topology-aware training framework that aligns the shared vision-language embedding space using persistent homology. To ensure scalability, we construct sparse graphs from point clouds to approximate topological features. We validate our approach, showing enhanced structural coherence of multilingual representations, higher zero-shot classification accuracy on CIFAR-100, and improved retrieval performance on xFlickr\&CO.