Poster
in
Workshop: Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling

Mitigating Self-Preference by Authorship Obfuscation

Taslim Mahbub ⋅ Shi Feng

Project Page [ OpenReview]

Abstract

Language models (LMs) judges are widely used to evaluate the quality of LM outputs. Despite their advantages, LM judges display concerning biases, notably self-preference—preferring their own answers over those from other LMs or humans, even when the alternative is objectively better. Following the self-recognition hypothesis, we apply black-box perturbations to obfuscate authorship in pairwise comparisons, aiming to reduce harmful self-preference. Simple synonym replacement for a few words reduces bias, but eliminating all stylistic cues via paraphrasing can reverse the effect, revealing that self-preference operates on multiple semantic levels. These findings highlight both the promise and the challenge of mitigating bias in LM judges.

Chat is not available.