Poster

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

Yiping Wang ⋅ Qing Yang ⋅ Zhiyuan Zeng ⋅ Liliang Ren ⋅ Liyuan Liu ⋅ Baolin Peng ⋅ Hao Cheng ⋅ Xuehai He ⋅ Kuan Wang ⋅ Jianfeng Gao ⋅ Weizhu Chen ⋅ Shuohang Wang ⋅ Simon Du ⋅ yelong shen

2025 Poster

Project Page [ Slides] [ OpenReview]

Abstract

We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the math reasoning capabilities of large language models (LLMs). Applying RLVR to the base model Qwen2.5-Math-1.5B, we identify a single example that elevates model performance on MATH500 from 36.0\% to 73.6\% (8.6\% improvement beyond format correction), and improves the average performance across six common mathematical reasoning benchmarks from 17.6\% to 35.7\% (7.0\% non-format gain). This result matches the performance obtained using the 1.2k DeepScaleR subset (MATH500: 73.6\%, average: 35.9\%), which contains the aforementioned example. Furthermore, RLVR with only two examples even slightly exceeds these results (MATH500: 74.8\%, average: 36.6\%). Similar substantial improvements are observed across various models (Qwen2.5-Math-7B, Llama3.2-3B-Instruct, DeepSeek-R1-Distill-Qwen-1.5B), RL algorithms (GRPO and PPO), and different math examples. In addition, we identify some interesting phenomena during 1-shot RLVR, including cross-category generalization, increased frequency of self-reflection, and sustained test performance improvement even after the training accuracy has saturated, a phenomenon we term \textit{post-saturation generalization}. Moreover, we verify that the effectiveness of 1-shot RLVR primarily arises from the policy gradient loss, distinguishing it from the "grokking" phenomenon. We also show the critical role of promoting exploration (e.g., by incorporating entropy loss with an appropriate coefficient) in 1-shot RLVR training. We also further discuss related observations about format correction, label robustness and prompt modification. These findings can inspire future work on RLVR efficiency and encourage a re-examination of recent progress and the underlying mechanisms in RLVR. Our code, models, and data are open source at https://github.com/ypwang61/One-Shot-RLVR.

Video

Chat is not available.