Poster
in
Workshop: NeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models

Liminal Training: Characterizing and Mitigating Subliminal Learning in Large Language Models

Atsushi Yanagisawa · Akbarzaib Khan · Thanjeetraaj Kaur Balraj Singh · Yunjong Na · Kevin Zhu · Antonio Mari

Project Page [ OpenReview]

Abstract

Subliminal learning, the unintended transmission of behavioral traits like misalignment or preference through semantically unrelated fine-tuning data, represents a critical and poorly understood phenomenon in Large Language Models (LLMs). We provide a detailed dynamic characterization of subliminal learning, focusing on the temporal evolution of trait acquisition during fine-tuning of Qwen2.5-1.5B-Instruct and Qwen2.5-3B-Instruct models. We find that the trait acquisition is a batch-invariant, non-linear spike concentrated sharply within the initial 10--20 training steps. We hypothesize that these dynamics are symptoms of a model transitions to a vulnerable parameter region. We then propose liminal training, which consists of adding an annealed KL regularizer to the fine-tuning loss, and provably mitigates subliminal learning, preventing the acquisition of unwanted traits.

Chat is not available.