Spotlight
in
Workshop: UniReps: Unifying Representations in Neural Models

How does fine-tuning affect your model? Mechanistic analysis on procedural tasks

Samyak Jain ⋅ Robert Kirk ⋅ Ekdeep S Lubana ⋅ Robert Dick ⋅ Hidenori Tanaka ⋅ Tim Rocktäschel ⋅ Edward Grefenstette ⋅ David Krueger

Project Page [ OpenReview]

Abstract

Fine-tuning large pre-trained models has become the de facto strategy for developing models that are safe to deploy. However, there has been little work that explains how fine-tuning alters the underlying capabilities learnt by a model during pretraining: does fine-tuning yield entirely novel capabilities or does it just modulate existing ones? We address this question empirically in synthetic settings with mechanistic interpretability tools (e.g., network pruning and probing) to understand how the model's underlying capabilities are changing. Our extensive analysis of the effects of fine-tuning shows: (i) fine-tuning rarely alters the underlying model capabilities; (ii) a minimal transformation, which we call a 'wrapper', is typically learned on top of the underlying model capabilities; and (iii) further fine-tuning on a task where such wrapped capabilities are relevant leads to sample-efficient "revival'' of the capability, i.e., the model begins reusing this capability in a few gradient steps. This indicates practitioners can unintentionally remove a model's safety wrapper by merely fine-tuning it on a superficially unrelated task.

Video

Chat is not available.