Liminal Training: Characterizing and Mitigating Subliminal Learning in Large Language Models
Abstract
Subliminal learning, the unintended transmission of behavioral traits like misalignment or preference through semantically unrelated fine-tuning data, represents a critical and poorly understood phenomenon in Large Language Models (LLMs). We provide a detailed dynamic characterization of subliminal learning, focusing on the temporal evolution of trait acquisition during fine-tuning of Qwen2.5-1.5B-Instruct and Qwen2.5-3B-Instruct models. We find that the trait acquisition is a batch-invariant, non-linear spike concentrated sharply within the initial 10--20 training steps. We hypothesize that these dynamics are symptoms of a model transitions to a vulnerable parameter region. We then propose liminal training, which consists of adding an annealed KL regularizer to the fine-tuning loss, and provably mitigates subliminal learning, preventing the acquisition of unwanted traits.