Poster
in
Workshop: Reliable ML from Unreliable Data

Breaking the Mirror: Activation-Based Mitigation of Self-Preference in LLM Evaluators

Dani Roytburg ⋅ Matthew Nguyen ⋅ Matthew Bozoukov ⋅ Hongyu Fu ⋅ Jou Barzdukas ⋅ Narmeen Oozeer

2025 Poster
in
Workshop: Reliable ML from Unreliable Data

Project Page [ OpenReview]

Abstract

Large language models (LLMs) increasingly serve as automated evaluators, yet they suffer from self-preference bias: a tendency to favor their own outputs over those of other models. This bias hampers the trustworthiness of synthetically generated evaluation data, affecting downstream alignment tasks such as preference tuning and model routing. We introduce and evaluate a lightweight activation-based safeguard to mitigate this problem at inference time without costly retraining. We release a curated dataset that disentangles self-preference bias into valid and invalid examples of self-preference, construct steering vectors using two state-of-the-art methods, and compare our intervention against prompting and Direct Preference Optimization. We show that while our safeguard reliably deters self-preference bias in up to 97% of cases, it comes with a key limitation: a countervailing instability when applying the same vectors to legitimate evaluations. These findings underscore the need to develop lightweight tooling for reliable LLM-as-judge data, motivating future directions in robustness. We make our code publicly available at https://anonymous.4open.science/r/steeringselfpreference-EEC6 for reproducibility.

Chat is not available.