Skip to yearly menu bar Skip to main content


Poster

Empowering Visible-Infrared Person Re-Identification with Large Foundation Models

Zhangyi Hu · Bin Yang · Mang Ye

[ ]
Wed 11 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

Visible-Infrared Person Re-identification (VI-ReID) often underperforms compared to RGB-based ReID due to the significant modality differences, primarily caused by the absence of detailed information in the infrared modality. With the development of Large Language Model (LLM)s and Language Vision Models (LVM)s, this motivates us to investigate a feasible solution to empower the VI-ReID performance with off-the-shelf foundations models. To this end, we propose a novel text-enhanced VI-ReID framework driven by Foundation Models (TVI-FM). The basic idea is to enrich the representation of infrared modality with the automatically generated textual descriptions. Specifically, we incorporate a pretrained multimodal language vision model (LVM) to extract textual features and incrementally fine-tune the text encoder to minimize the domain gap between generated texts and original visual images. Meanwhile, to enhance the infrared modality with text, we employs LLM to augment textual descriptions, leveraging modality alignment capabilities of LVMs and LVM-generated feature-level filters. This allows the text model to learn complementary features from the infrared modality, ensuring semantic structural consistency between the fusion modality and the visible modality. Furthermore, we introduce modality joint learning to align features of all modalities, ensuring that textual features maintain stable semantic representation of overall pedestrian appearance during complementary information learning. Additionally, a modality ensemble retrieving strategy is proposed to consider each query modality for leveraging their complementary strengths to improve retrieval effectiveness and robustness. Extensive experiments demonstrate that our method significantly improves retrieval performance on three expanded cross-modal re-identification datasets, paving the way for utilizing LLMs in downstream data-demanding tasks. The code will be released.

Live content is unavailable. Log in and register to view live content