Narrow RL Induces Broad Behavior Changes in LLMs
Jo Jiao · Austin Kozlowski · James Evans
Abstract
We study whether reinforcement learning (RL) on a narrow objective induces broader behavioral shifts in large language models. We apply RL to maximize the model's payoff in the iterated Prisoner's Dilemma against a cooperative opponent, leading to defective behaviors. We then evaluate out-of-domain social preference tasks: the Dictator Game, Social Value Orientation, and the Narcissistic Admiration and Rivalry Questionnaire. Relative to the pre-RL model, the RL-trained model shows a consistent increase in selfish and individualistic behavior. The results suggest that narrow RL can shift latent social preferences beyond the optimized task.
Chat is not available.
Successful Page Load