Mirage or Method? How Model–Task Alignment Induces Divergent RL Conclusions
Abstract
Recent advances in applying reinforcement learning (RL) to large language models (LLMs) have led to substantial progress. In particular, a series of remarkable yet often counterintuitive phenomena have been reported in LLMs, exhibiting patterns not typically observed in traditional RL settings (e.g., spurious rewards, one-shot RL). However, the precise conditions under which these observations hold remain unclear. In this work, we identify a key factor that differentiates RL observations: whether the pretrained model already exhibits strong Model-Task Alignment, as measured by pass@k on the target task. Through systematic experiments across diverse models and tasks, we find that while standard RL remains robust, many counterintuitive results emerge only under strong model-task alignment.