On the Rollout-Training Mismatch in Modern RL Systems
Abstract
Modern reinforcement learning (RL) systems aim to be efficient by employing hybrid designs for rollout generation (e.g., vLLM) and model training (e.g., FSDP). However, the implementation gap can implicitly turns on-policy RL into off-policy, as the rollout and training policies can produce significantly different token probabilities despite sharing the same model weights. We dive into this rollout-training mismatch problem and propose to use truncated importance sampling (TIS) as a simple yet effective fix. TIS applies importance sampling correction to bridge the distribution gap between rollout and training, enabling stable RL training even with quantized rollouts. We demonstrate TIS's effectiveness across multiple settings, showing it can preserve downstream performance while enabling significant speedups through rollout quantization. Our work provides algorithmic solution to address the systematic mismatch problem in efficient RL training.