Poster
in
Workshop: Workshop on Multi-Turn Interactions in Large Language Models

$\mathbf{T^3}$: Reducing Belief Deviation in Reinforcement Learning for Active Reasoning

Deyu Zou ⋅ Yongqiang Chen ⋅ Jianxiang Wang ⋅ Garry YANG ⋅ Mufei Li ⋅ Qing Da ⋅ Pan Li ⋅ Yu Gong ⋅ James Cheng

2025 Poster
in
Workshop: Workshop on Multi-Turn Interactions in Large Language Models

Project Page [ OpenReview]

Abstract

Active reasoning requires Large language models (LLMs) to interact with external sources and gather missing information to solve a problem. Reinforcement learning with outcome reward, as a \textit{de facto} approach to incentivize active reasoning of LLMs, however, often loses track of problem states and generates uninformative and repetitive actions. Consequently, it leads to more and more belief deviation -- the divergence between the oracle belief and the agent’s internal belief state.To mitigate the issue, it is essential to properly assign rewards to and promote intermediate steps that are more purposeful and informative in solving the problem while avoiding being trapped by cumulative belief deviation.As directly tracking the deviation of belief states is intractable, we introduce $\mathbf{T^3}$, which leverages proxy signals of excessive belief deviation to assign intermediate rewards or directly truncates the rollout trajectories during training. Across two recent datasets tailored for active reasoning, $\mathbf{T^3}$ improves both performance and stability of diverse RL algorithms, achieving gains up to 30\%. These results highlight belief control as a key principle for training robust LLM-based active reasoners.

Chat is not available.