Timezone: »
Rapid progress in text-to-image generation has been often measured by Frechet Inception Distance (FID) to capture how realistic the generated images are, or by R-Precision to assess if they are well conditioned on the given textual descriptions. However, a systematic study on how well the text-to-image synthesis models generalize to novel word compositions is missing. In this work, we focus on assessing how true the generated images are to the input texts in this particularly challenging scenario of novel compositions. We present the first systematic study of text-to-image generation on zero-shot compositional splits targeting two scenarios, unseen object-color (e.g. "blue petal") and object-shape (e.g. "long beak") phrases. We create new benchmarks building on the existing CUB and Oxford Flowers datasets. We also propose a new metric, based on a powerful vision-and-language CLIP model, which we leverage to compute R-Precision. This is in contrast to the common approach where the same retrieval model is used during training and evaluation, potentially leading to biased behavior. We experiment with several recent text-to-image generation methods. Our automatic and human evaluation confirm that there is indeed a gap in performance when encountering previously unseen phrases. We show that the image correctness rather than purely perceptual quality is especially impacted. Finally, our CLIP-R-Precision metric demonstrates better correlation with human judgments than the commonly used metric.
Author Information
Dong Huk Park (UC Berkeley)
Samaneh Azadi (UC Berkeley)
Xihui Liu (The Chinese University of Hong Kong)
Trevor Darrell (Electrical Engineering & Computer Science Department)
Anna Rohrbach (UC Berkeley)
More from the Same Authors
-
2022 Workshop: Workshop on Machine Learning for Creativity and Design »
Tom White · Yingtao Tian · Lia Coleman · Samaneh Azadi -
2022 Poster: K-LITE: Learning Transferable Visual Models with External Knowledge »
Sheng Shen · Chunyuan Li · Xiaowei Hu · Yujia Xie · Jianwei Yang · Pengchuan Zhang · Zhe Gan · Lijuan Wang · Lu Yuan · Ce Liu · Kurt Keutzer · Trevor Darrell · Anna Rohrbach · Jianfeng Gao -
2022 Poster: Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens »
Elad Ben Avraham · Roei Herzig · Karttikeya Mangalam · Amir Bar · Anna Rohrbach · Leonid Karlinsky · Trevor Darrell · Amir Globerson -
2022 Poster: Visual Prompting via Image Inpainting »
Amir Bar · Yossi Gandelsman · Trevor Darrell · Amir Globerson · Alexei Efros -
2021 Workshop: Machine Learning for Creativity and Design »
Tom White · Mattie Tesfaldet · Samaneh Azadi · Daphne Ippolito · Lia Coleman · David Ha -
2021 Poster: CLIP-It! Language-Guided Video Summarization »
Medhini Narasimhan · Anna Rohrbach · Trevor Darrell -
2021 Poster: Early Convolutions Help Transformers See Better »
Tete Xiao · Mannat Singh · Eric Mintun · Trevor Darrell · Piotr Dollar · Ross Girshick -
2021 Poster: Teachable Reinforcement Learning via Advice Distillation »
Olivia Watkins · Abhishek Gupta · Trevor Darrell · Pieter Abbeel · Jacob Andreas -
2020 Workshop: Machine Learning for Creativity and Design 4.0 »
Luba Elliott · Sander Dieleman · Adam Roberts · Tom White · Daphne Ippolito · Holly Grimm · Mattie Tesfaldet · Samaneh Azadi -
2019 Poster: Learning to Predict Layout-to-image Conditional Convolutions for Semantic Image Synthesis »
Xihui Liu · Guojun Yin · Jing Shao · Xiaogang Wang · Hongsheng Li -
2018 Poster: Speaker-Follower Models for Vision-and-Language Navigation »
Daniel Fried · Ronghang Hu · Volkan Cirik · Anna Rohrbach · Jacob Andreas · Louis-Philippe Morency · Taylor Berg-Kirkpatrick · Kate Saenko · Dan Klein · Trevor Darrell