Learning Without Critics? Revisiting GRPO in Classical Reinforcement Learning Environments
Bryan Lincoln Marques de Oliveira · Felipe Vieira Frujeri · Marcos Paulo Caetano Mendes Queiroz · Luana Guedes Barros Martins · Telma Lima · Luckeciano Carvalho Melo
Abstract
Group Relative Policy Optimization (GRPO) has emerged as a scalable alternative to Proximal Policy Optimization (PPO) by eliminating the learned critic and instead estimating advantages through group-relative comparisons of trajectories. This simplification raises fundamental questions about the necessity of learned baselines in policy-gradient methods. We present the first systematic study of GRPO in classical single-task reinforcement learning environments, spanning discrete and continuous control tasks. Through controlled ablations isolating baselines, discounting, and group sampling, we reveal three key findings: (1) learned critics remain essential for long-horizon tasks—all critic-free baselines underperform PPO except in short-horizon environments like CartPole where episodic returns can be effective; (2) GRPO benefits from high discount factors ($\gamma = 0.99$) except in HalfCheetah, where lack of early termination favors moderate discounting ($\gamma = 0.9$); (3) smaller group sizes outperform larger ones, suggesting limitations in batch-based grouping strategies that mix unrelated episodes. These results reveal both the limitations of critic-free methods in classical control and the specific conditions where they remain viable alternatives to learned value functions.
Successful Page Load