Skip to yearly menu bar Skip to main content


Poster

CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching

DONGZHI JIANG · Guanglu Song · Xiaoshi Wu · Renrui Zhang · Dazhong Shen · ZHUOFAN ZONG · Yu Liu · Hongsheng Li


Abstract:

Diffusion models have demonstrated great success in the field of text-to-image generation. However, alleviating the misalignment between the text prompts and images is still challenging. We break down the problem into two causes: concept ignorance and concept mismapping. To tackle the two challenges, we propose CoMat, an end-to-end diffusion model fine-tuning strategy with the image-to-text concept matching mechanism. Firstly, we introduce a novel image-to-text concept activation module to guide the diffusion model in revisiting ignored concepts. Additionally, an attribute concentration module is proposed to map the text conditions of each entity to its corresponding image area correctly. Extensive experimental evaluations, conducted across three distinct text-to-image alignment benchmarks, demonstrate the superior efficacy of our proposed method, CoMat-SDXL, over the baseline model, SDXL~\cite{podell2023sdxl}. We also show that our method enhances general condition utilization capability and generalizes to the long and complex prompt despite not specifically training on it.

Live content is unavailable. Log in and register to view live content