Fundamental Limits of Language Model Distillation under Stochastic Output Perturbations
Abstract
The growing accessibility of large language models (LLMs) has made model distillation a central threat to intellectual property protection and secure deployment. Existing defenses against unauthorized distillation—such as watermarking, noise injection, and randomized rounding—lack rigorous theoretical foundations. In this work, we present the first information-theoretic analysis of LLM distillation resistance under randomized output perturbations. We formalize the adversarial setting where a student model attempts to approximate a teacher model given a finite number of perturbed queries, and we derive lower bounds on achievable student error as a function of the perturbation magnitude and query budget. Our results establish impossibility thresholds: for certain perturbation mechanisms, no polynomial-query adversary can reduce error below a constant bound. We further characterize the query complexity required to achieve a desired approximation accuracy, providing provable tradeoffs between model utility and distillation resistance.