Poster
in
Workshop: MATH-AI: The 5th Workshop on Mathematical Reasoning and AI

Inpainting-Guided Policy Optimization for Diffusion Large Language Models

Siyan Zhao · Mengchen Liu · Jing Huang · Miao Liu · Chenyu Wang · Bo Liu · Yuandong Tian · Guan Pang · Sean Bell · Aditya Grover · Feiyu Chen

Project Page [ OpenReview]

Abstract

Masked diffusion large language models (dLLMs) are emerging as promising alternatives to autoregressive LLMs, offering competitive performance while supporting unique generation capabilities such as inpainting. We explore how inpainting can inform RL algorithm design for dLLMs by addressing a key challenge: sparse reward signals and sample waste when LLMs fail to discover correct solutions. We introduce IGPO (Inpainting Guided Policy Optimization), an RL framework that strategically injects partial ground-truth reasoning traces during online sampling to guide exploration toward promising trajectory spaces while preserving self-generated reasoning. Applied to group-based optimization methods like GRPO, IGPO restores meaningful gradients when exploration failures cause zero advantages. Combined with supervised fine-tuning on synthetically rewritten concise traces and entropy-based filtering, our approach achieves state-of-the-art performance on four mathematical benchmarks across full-attention based dLLMs.

Chat is not available.