Let’s Try Again: Eliciting Multi-Turn Reasoning in Language Models via Simplistic Feedback
Abstract
Multi-turn problem solving is critical yet challenging for Large Reasoning Models (LRMs) to reflect on their reasoning and revise from feedback. Existing Reinforcement Learning with Verifiable Reward (RLVR) methods train large reasoning models on a single-turn paradigm. However, we observe that models trained with existing RL paradigms often fail to explore alternative reasoning paths across multiple turns and lack the capacity for self-reflection, resulting in repetitive and unadapted responses to contextual feedback. We ask: Can LRMs learn to reflect their answers in a multi-turn context? In this work, we find that training models with multi-turn RL using only unary feedback (for example, “Let’s try again”) after wrong answers can improve both single-turn performance and multi-turn reasoning. We introduce Unary Feedback as Observation (UFO) for reinforcement learning, which uses minimal yet common unary user feedback during iterative problem solving. It can be easily applied to existing single-turn RL training setups. Experimental results show that RL training with UFO keeps single-turn performance and improves multi-turn reasoning accuracy by up to 14%, enabling models to reflect on prior failures and refine their reasoning accordingly. To further minimize the number of turns needed for a correct answer while encouraging diverse reasoning when mistakes occur, we design reward structures that guide models to produce careful and deliberate answers in each turn.