Poster
Magnet: We Never Know How Text-to-Image Diffusion Models Work, Until We Learn How Vision-Language Models Function
Chenyi Zhuang · Ying Hu · Pan Gao
East Exhibit Hall A-C #1609
Text-to-image diffusion models, Stable Diffusion in particular, have revolutionized the computer vision community and facilitated a wide range of downstream applications. However, it is still challenging to produce text-aligned images when given complicated prompts. Prior studies have pointed to blended text embeddings as the cause of improper binding, but few have provided a detailed explanation. In this work, we rethink the CLIP text encoder and aim to answer how the lack of attribute understanding affects attribute binding during diffusion-based generation. Investigating different types of text embedding, we observe the attribute bias phenomenon and further point out the context issue of the padding embedding. To tackle the binding problem, we propose Magnet, applying the binding vector on the object embedding to disentangle different concepts. The positive vector pulls the target attribute, and the negative vector pushes unrelated attributes. We further introduce the neighbor objects to improve the estimation accuracy. Extensive experiments are conducted to demonstrate the effectiveness and efficiency of Magnet. The proposed binding vector can enhance the binding, showing the anti-prior ability to generate unnatural concepts. Without training or fine-tuning, our textual manipulation improves synthesis quality and text alignment at a negligible cost.
Live content is unavailable. Log in and register to view live content