Multimodal Robustness Benchmark for Concept Erasure in Diffusion Models
Abstract
Text-to-image diffusion models can generate harmful or copyrighted content, motivating research on concept erasure. However, existing methods mainly focuses on the concept erasure from text prompts, ignoring that from other input sources, such as the learnable text prompt from textual inversion and inverted noise from DDIM inversion. To bridge this gap, we introduce a novel multimodal evaluation framework that systematically evaluates concept erasure across text, hybrid (textual inversion), and latent (DDIM inversion) spaces, under both white- and black-box settings. Our analysis results show that existing methods achieve strong suppression in text space but fail in the hybrid and latent settings. To mitigate these vulnerabilities, we further propose IRECE (Inference-time Robustness Enhancement for Concept Erasure), a lightweight, plug-and-play module that localizes target concepts via cross-attention and perturbs the associated latents during denoising. IRECE reduces reproduction rates without retraining while preserving visual quality. To the best of our knowledge, the proposed framework is not only the first to propose a comprehensive and systematic evaluation of concept erasure beyond text prompts but also offers a practical enhancement toward more reliable protective AI.