Inside you are many wolves: Using cognitive models to interpret value trade-offs in LLMs
Abstract
Navigating everyday social situations requires juggling conflicting goals, such as conveying harsh truths while maintaining trust and being mindful of others’ feelings. In cognitive science, so-called "cognitive models" provide formal accounts of these trade-offs in humans, by modeling the weighting of a speaker's competing utility functions in choosing an action or utterance. In this work, we use a empirically-validated cognitive model of polite speech production in humans to interpret the extent to which LLMs represent human-like trade-offs between being informational, kind, and saving face. We apply this lens to systematically evaluate value trade-offs in two encompassing model settings: degrees of reasoning "effort" in frontier black-box models, and RL post-training dynamics of open-source models. Our results reveal that reasoning-optimized frontier models prioritize informational over social utility compared to standard models, even in our natural language domain. Post-training alignment dynamics show the largest utility shifts occur within the first 25% of training, with persistent effects from base model choice outweighing feedback dataset or alignment method. We show that this tool provides a valuable mechanism for guiding model development--enabling the formation of fine-grained hypotheses about high-level behavioral concepts, understanding the extent of training needed to achieve desired model values, and shaping recipes for higher-order reasoning and alignment.