Poster
in
Workshop: Deep Learning for Code in the Agentic Era

SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

Zhenghai Xue ⋅ Longtao Zheng ⋅ Qian Liu ⋅ Yingru Li ⋅ Xiaosen Zheng ⋅ Zejun MA ⋅ Bo An

Project Page [ OpenReview]

Abstract

Large Language Models (LLMs) can enhance their reasoning by interacting with external tools, a paradigm known as Tool-Integrated Reasoning (TIR). However, extending TIR to multi-turn settings using Reinforcement Learning (RL) often exhibits training instability and degraded performance. We attribute the instability to harmful negative samples resulting from distributional drift and compounding errors induced by using external tool outputs during multi-turn rollout. To address this issue, we introduce SimpleTIR, a simple method that stabilizes multi-turn TIR training via filtering out trajectories with "void turns", i.e., turns that yield neither a code block nor a final answer. Specifically, we remove those trajectories from the policy update to block harmful gradients, while retaining them in advantage estimation to keep the estimate unbiased. Extensive experiments show that SimpleTIR effectively mitigates gradient norm explosion and stabilizes multi-turn RL training from base models. It achieves state-of-the-art performance on challenging math reasoning benchmarks, including an AIME24 score of 50.5 starting from the Qwen2.5-7B base model. SimpleTIR also promotes more diverse reasoning behaviors such as self-correction and cross-validation, outperforming prior methods trained from stronger instruction-tuned models.

Chat is not available.