Un-Distillable LLMs via Entropy-Perturbed Logits
Abstract
Large Language Models (LLMs) are vulnerable to distillation attacks, where adversaries replicate a proprietary model's knowledge into a smaller student model, leading to intellectual property theft and weakened security guarantees. We address this challenge by introducing \emph{provably un-distillable LLMs} through entropy-based obfuscation of output logits. We derive information-theoretic lower bounds on the error floor of any student model trained on obfuscated outputs, showing that distillation loss scales at least quadratically with the obfuscation strength. Experiments confirm the theory: empirical student loss exceeds the derived bounds, validating the feasibility of secure and un-distillable architectures. This work establishes the first provable foundations for resisting unauthorized distillation in LLMs.