BiasEdit: Debiasing Stereotyped Language Models via Model Editing
Xin Xu
Abstract
Existing debiasing strategies, such as retraining a model with counterfactual data, representation projection, and prompting, often fail to efficiently eliminate bias or directly alter the models' biased internal representations. To address these issues, we propose **BiasEdit**, an efficient debiasing technique via model editing. BiasEdit employs a *debiasing loss* $\mathcal{L}_d = \text{KL}(P_{\theta_{\tilde{\mathcal{W}}}}(x_{\text{stereo}}) \| P_{\theta_{\tilde{\mathcal{W}}}}(x_{\text{anti}})) + \text{KL}(P_{\theta_{\tilde{\mathcal{W}}}}(x_{\text{anti}}) \| P_{\theta_{\tilde{\mathcal{W}}}}(x_{\text{stereo}}))$ guiding editor networks to conduct local edits on partial parameters of a language model for debiasing while preserving the language modeling abilities during editing through a *retention loss* $\mathcal{L}_r = \text{KL}(P_{\theta_{\mathcal{W}}}(x_{\text{mless}}) \| P_{\theta_{\tilde{\mathcal{W}}}}(x_{\text{mless}}) )$. Experiments on [StereoSet](https://aclanthology.org/2021.acl-long.416/) and [Crows-Pairs](https://aclanthology.org/2020.emnlp-main.154/) demonstrate the effectiveness, efficiency, and robustness of BiasEdit in eliminating bias compared to tangential debiasing baselines, and little to no impact on the language models' general capabilities.
Chat is not available.
Successful Page Load