Structured Response Diversity with Mutual Information
Devan Shah ⋅ Owen Yang ⋅ Daniel Yang ⋅ Chongyi Zheng ⋅ Benjamin Eysenbach
2025 Poster
in
Workshop: Workshop on Scaling Environments for Agents
in
Workshop: Workshop on Scaling Environments for Agents
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has greatly improved the reasoning abilities of large language models (LLMs) on mathematics and programming tasks, often by maximizing pass@1 correctness. However, optimizing single-attempt accuracy can inadvertently suppress response diversity across repeated attempts, narrowing exploration and overlooking underrepresented strategies. We adapt *Mutual Information Skill Learning* (MISL) to LLMs and develop training-time rewards that induce *structured response diversity*: a discrete latent $z$ selects a reproducible "strategy" that steers the token distribution toward distinct modes. We propose two complementary rewards for Group Relative Policy Optimization (GRPO): a *token-level* mutual information (MI) reward that encourages trajectory specificity to $z$, and a *semantic* MI reward that encourages separation in an embedding space. Experiments on GSM8K with three open-weight models, Llama-3.1-8B, Qwen-2.5-7B, and R1-Distilled-Qwen2.5-Math-1.5B, with 2,000 training problems show that token-level MISL improves multi-attempt metrics, yielding median gains of $\sim$4\% in pass@k and $\sim$12\% in consensus@k without degrading pass@1. We further outline a theoretical connection that shows that improvement in pass@k is upper-bounded linearly by the mutual information. We discuss practical considerations, the instability of the semantic MI estimator, and open directions.
Chat is not available.
Successful Page Load