Poster
in
Workshop: ML x OR: Mathematical Foundations and Operational Integration of Machine Learning for Uncertainty-Aware Decision-Making

RiskPO: Risk-based Policy Optimization with Verifiable Reward for LLM Post-Training

Tao Ren ⋅ Jinyang Jiang ⋅ Hui Yang ⋅ Wan Tian ⋅ Yijie Peng

Project Page [ OpenReview]

Abstract

Reinforcement Learning with Verifiable Reward has become a central paradigm for post-training Large Language Models (LLMs). Group Relative Policy Optimization (GRPO) with the mean-based objective suffers from limited exploration and reasoning gains. We propose Risk-based Policy Optimization (RiskPO), which leverages risk measures from Operations Research to address these issues. In particular, we introduce a Mixed Value-at-Risk objective and adopt a bundle-wise training scheme that bundles multiple questions to provide stable and informative signals. Numerical results show that RiskPO consistently outperforms GRPO and its variants across multiple mathematical reasoning benchmarks, achieving substantial improvements on both Pass@1 and Pass@k metrics. These results highlight the effectiveness of risk-based optimization in enhancing exploration and expanding the reasoning capabilities of LLMs.

Chat is not available.