CoDaPO: Confidence and Difficulty-Adaptive Policy Optimization for Language Models
Abstract
Reinforcement learning (RL) post-training strengthens reasoning in large language models (LLMs), yet the prevailing GRPO algorithm exhibits persistent issues. Using a PRAG lens (Probability, Reward, Advantage, Gradient), we diagnose three mechanisms: (i) probability inflation—clipping induces one-way confidence drift with weak KL correction, collapsing entropy; (ii) advantage contraction—group normalization dulls update signals as accuracy rises; and (iii) hierarchical convergence—easy questions improve quickly while hard ones advance slowly via rare discoveries. We then introduce CoDaPO, a confidence- and difficulty–adaptive policy optimization framework that rescales per-trajectory advantages by confidence (curbing overconfidence and drift) and difficulty (sustaining learning on hard questions). Across seven benchmarks, CoDaPO demonstrates improvements on mathematical reasoning benchmarks for small and middle-scale models.