Undistillable Open Language Models with Teacher Scrambling
Abstract
Open-weight security requires that post-release foundation models are resistant to misuse. Even if a model is made unmodifiable, an attacker may distill it into a new model they can modify. Previous works have examined preventing distillation of closed-access models. We analyze undistillability under the constraint that an attacker has access to unmodifiable language model weights and introduce Teacher Scrambling, a novel method that preserves task utility for the original model while preventing information gain from the logit rank distribution via a logit rank scrambling loss. We show that attempting to distill student models from a scrambled teacher results in worse performance than training with label smoothing, therefore defeating the purpose of attempted distillation.