What Makes a Good Generated Image? Studying Human & LLM Image Preference Alignment
Abstract
Automated evaluation of text-to-image generation remains a challenging problem. Recent works use multimodal LLMs to judge image quality, but offer little insight into how these assessments are made. In this work, we study what attributes of an image--aesthetics, lack of artifacts, anatomical accuracy, compositional correctness, object adherence, and style--are important for LLMs and humans to make image quality judgements. We curate a dataset of human and LLM preferences using generated image pairs, finding that LLMs learn much weaker relationships between image quality attributes than humans. Then, we study individual attributes by generating synthetic datasets with a high degree of control for each axis. While humans find all these datasets easy to judge, some attributes, such as anatomy, are much more difficult for multimodal LLMs to judge. Taken together, these findings reveal key differences between how humans and multimodal LLMs perceive images.