Poster
in
Workshop: MATH-AI: The 5th Workshop on Mathematical Reasoning and AI

CoDaPO: Confidence and Difficulty-Adaptive Policy Optimization for Language Models

Zhanke Zhou · Xiangyu Lu · Chentao Cao · Brando Miranda · Tongliang Liu · Bo Han · Sanmi Koyejo

Project Page [ OpenReview]

Abstract

Reinforcement learning (RL) post-training strengthens reasoning in large language models (LLMs), yet the prevailing GRPO algorithm exhibits persistent issues. Using a PRAG lens (Probability, Reward, Advantage, Gradient), we diagnose three mechanisms: (i) probability inflation—clipping induces one-way confidence drift with weak KL correction, collapsing entropy; (ii) advantage contraction—group normalization dulls update signals as accuracy rises; and (iii) hierarchical convergence—easy questions improve quickly while hard ones advance slowly via rare discoveries. We then introduce CoDaPO, a confidence- and difficulty–adaptive policy optimization framework that rescales per-trajectory advantages by confidence (curbing overconfidence and drift) and difficulty (sustaining learning on hard questions). Across seven benchmarks, CoDaPO demonstrates improvements on mathematical reasoning benchmarks for small and middle-scale models.

Chat is not available.