The Mind's Eye: A Multi-Faceted Reward Framework for Guiding Visual Metaphor Generation
Abstract
Visual metaphor generation is a challenging task that aims to generate an image given an input text metaphor. Inherently, such a task needs language understanding to bind a source concept to a target concept, in a way that preserves meaning while ensuring visual coherence. Current systems often literalize input text and lack reliable signals for figurative alignment. We present a self-evaluating framework that guides generation using multiple rewards, combining standard metrics with two new ones: a Metaphor Decomposition Score and a Meaning Alignment metric. Within this reward-based framework, we propose two pipelines: 1) a training-free pipeline that decomposes input text into source, target, and intended meaning (S-T-M) for synthesis, and 2) a pipeline that uses lightweight reinforcement learning (RL) to optimize proposed self-evaluation rewards without large-scale retraining. Upon evaluation, the training-free pipeline outperforms strong closed baselines (GPT-4o and Imagen), on decomposition, CLIP, and meaning alignment scores, with the RL-based pipeline close behind. A user study shows participants still prefer GPT-4o overall, yet our training-free pipeline leads open-source methods and edges Imagen on abstract metaphors. Analyses indicate S-T-M prompting benefits longer or more abstract metaphors, and that results are sensitive to sampler settings. This framework is useful for generating controllable visual metaphors with modest computational requirements, and the reward design offers a practical path to aligning creative image synthesis with intended meaning. The code is available at https://github.com/gak97/visual-metaphor.