Poster
in
Workshop: Reliable ML from Unreliable Data

GUARD: Guiding Unbiased Alignment through Reward Debiasing

Advay Samnerkar ⋅ Sagnik Bhattacharya ⋅ Kailash Ranganathan ⋅ Kevin Zhu ⋅ Ashwinee Panda

2025 Poster
in
Workshop: Reliable ML from Unreliable Data

Project Page [ OpenReview]

Abstract

Reward misspecification in RLHF threatens the reliability of large language models by amplifying spurious correlations and producing unstable or unsafe behavior. Expert-defined harm categories provide a stable signal for post-training evaluation, but reward models often encode categorical biases that undermine trustworthiness. We address this challenge through an information-theoretic reliability objective: minimizing mutual information between reward scores and sensitive categories. Our approach enforces invariance via adversarial training while integrating curiosity-driven intrinsic rewards into PPO to preserve diversity. Framing debiasing as a minimax game yields reward models that are both robust and verifiably category-independent. Empirically, our Fair-RM achieves near-neutral bias on CrowS-Pairs and StereoSet, reduces post-PPO disparity on HH-RLHF, and scales to 19-category fairness in PKU-SafeRLHF. These results demonstrate improved calibration and stability under distribution shift, establishing our method as a practical reliability control for safety-critical RLHF deployment.

Chat is not available.