Poster
in
Workshop: Reliable ML from Unreliable Data

AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

Shuo Yang ⋅ Qihui Zhang ⋅ Yuyang Liu ⋅ Yue Huang ⋅ Xiaojun Jia ⋅ Kun-Peng Ning ⋅ Jia-Yu Yao ⋅ jigang wang ⋅ Dai Hailiang ⋅ Yibing Song ⋅ Li Yuan

2025 Poster
in
Workshop: Reliable ML from Unreliable Data

Project Page [ OpenReview]

Abstract

Large language models (LLMs) are vulnerable to safety risks during fine-tuning, where even small amounts of malicious or benign data can compromise safeguards. In this paper, building on the concept of the alignment direction---defined by the weight difference between aligned and unaligned models---we observe that perturbations along this direction preserve model safety. In contrast, perturbations orthogonal to this alignment are strongly correlated with harmful updates, rapidly degrading safety and framing the parameter space as a "narrow safety basin". Based on this insight, we propose AsFT (Anchoring Safety in Fine-Tuning), a data-free method that formulates safety-preserving fine-tuning as a constrained optimization problem. AsFT uses the alignment direction as an anchor and restricts parameter updates within the "narrow safety basin" through a tractable Lagrangian relaxation, thereby suppressing harmful updates while preserving task-relevant adaptation. Extensive experiments across multiple datasets and models demonstrate that AsFT reduces harmful behaviors by up to 7.60\%, improves task performance by 3.44\%, and consistently outperforms existing methods across diverse fine-tuning scenarios.

Chat is not available.