Inpainting-Guided Policy Optimization for Diffusion Large Language Models
Abstract
Masked diffusion large language models (dLLMs) are emerging as promising alternatives to autoregressive LLMs, offering competitive performance while supporting unique generation capabilities such as inpainting. We explore how inpainting can inform RL algorithm design for dLLMs by addressing a key challenge: sparse reward signals and sample waste when LLMs fail to discover correct solutions. We introduce IGPO (Inpainting Guided Policy Optimization), an RL framework that strategically injects partial ground-truth reasoning traces during online sampling to guide exploration toward promising trajectory spaces while preserving self-generated reasoning. Applied to group-based optimization methods like GRPO, IGPO restores meaningful gradients when exploration failures cause zero advantages. Combined with supervised fine-tuning on synthetically rewritten concise traces and entropy-based filtering, our approach achieves state-of-the-art performance on four mathematical benchmarks across full-attention based dLLMs.