Structured Response Diversity with Mutual Information
Devan Shah · Owen Yang · Daniel Yang · Chongyi Zheng · Benjamin Eysenbach
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has greatly improved the reasoning abilities of large language models (LLMs) on mathematics and programming tasks, often by maximizing pass@1 correctness. However, optimizing single-attempt accuracy can inadvertently suppress response diversity across repeated attempts, narrowing exploration and overlooking underrepresented strategies. We adapt *Mutual Information Skill Learning* (MISL) to LLMs and develop training-time rewards that induce *structured response diversity*: a discrete latent $z$ selects a reproducible "strategy" that steers the token distribution toward distinct modes. We propose two complementary rewards for Group Relative Policy Optimization (GRPO): a *token-level* mutual information (MI) reward that encourages trajectory specificity to $z$, and a *semantic* MI reward that encourages separation in an embedding space. Experiments on GSM8K with three open-weight models, Llama-3.1-8B, Qwen-2.5-7B, and R1-Distilled-Qwen2.5-Math-1.5B, with 2,000 training problems show that token-level MISL improves multi-attempt metrics, yielding median gains of $\sim$4\% in pass@k and $\sim$12\% in consensus@k without degrading pass@1. We further outline a theoretical connection that shows that improvement in pass@k is upper-bounded linearly by the mutual information. We discuss practical considerations, the instability of the semantic MI estimator, and open directions.
Chat is not available.
Successful Page Load