MLPEdit-Bench: Benchmarking Reasoning-Based Layer-wise Poster Editing
Abstract
Reasoning-based image editing requires multimodal models to interpret user intent and then modify images accordingly. While multimodal large language models have been incorporated into image editing system to advance reasoning capability, prior work has emphasized complex textual instructions while neglecting the inherent complexity of images, such as multi-layer poster compositions. To address this, we introduce theMulti Layer Poster Editing dataset (MLPEdit-Bench), containing 19K curated examples for evaluating visually-rich multi-layer reasoning-based image editing. We further propose a VQA-based metric to assess models’ instruction following capability. Experiments on five state-of-the-art models show persistent challenges in identifying poster components and preserving non-edited regions. To overcome these issues, we present MLPE-Agent, a framework combining a GRPO-trained vision-language reasoner with an image editor, which significantly outperforms existing open-source methods. We expect MLPEdit-Bench and MLPE-Agent to advance research on visually-rich reasoning-based poster editing.