Breaking the Mirror: Examining Self-Preference in LLM Evaluators through Activation-Based Representations
Abstract
Large language models (LLMs) increasingly serve as automated evaluators, yet they suffer from self-preference bias: a tendency to favor their own outputs over those of other models. This bias hampers the trustworthiness of synthetically generated evaluation data. Thus, we propose a methodology based on activation steering to modulate the internal representation of self-preference bias. We release a curated dataset that disentangles this bias into valid and invalid examples of self-preference, construct steering vectors using two state-of-the-art methods, and compare our intervention against prompting and Direct Preference Optimization. Although our approach finds linear representations for self-preference bias---changing outcomes in up to 97% of biased cases---we find that it comes with a key limitation: a countervailing instability when applying the same vectors to legitimate evaluations. These findings highlight the need to isolate self-preference biases in LLM-as-judge evaluations, motivating future directions in synthetic evaluation data.