Mitigating Self-Preference by Authorship Obfuscation
Taslim Mahbub â‹… Shi Feng
Abstract
Language models (LMs) judges are widely used to evaluate the quality of LM outputs. Despite their advantages, LM judges display concerning biases, notably self-preference—preferring their own answers over those from other LMs or humans, even when the alternative is objectively better. Following the self-recognition hypothesis, we apply black-box perturbations to obfuscate authorship in pairwise comparisons, aiming to reduce harmful self-preference. Simple synonym replacement for a few words reduces bias, but eliminating all stylistic cues via paraphrasing can reverse the effect, revealing that self-preference operates on multiple semantic levels. These findings highlight both the promise and the challenge of mitigating bias in LM judges.
Chat is not available.
Successful Page Load