A Granular Study of Safety Pretraining under Model Abliteration
Abstract
Open-weight LLMs can be modified at inference time with simple activationedits, which raises a practical question for safety: do common safety interventionslike refusal training or metatag training survive such edits? We study modelabliteration, a lightweight projection technique designed to remove refusal-sensitivedirections, and conduct a controlled evaluation across a granular sequence ofSafety Pretraining checkpoints for SmolLM2-1.7B, alongside widely used openbaselines. For each of 20 systems, original and abliterated, we issue 100 promptswith balanced harmful and harmless cases, classify responses as REFUSAL orNON-REFUSAL using multiple judges, and validate judge fidelity on a smallhuman-labeled subset. We also probe whether models can identify refusal in theirown outputs. Our study produces a checkpoint-level characterization of whichdata-centric safety components remain robust under abliteration, quantifies howjudge selection influences evaluation outcomes, and outlines a practical protocolfor integrating inference-time edits into safety assessments. Code:https://github.com/shashankskagnihotri/safety_pretraining .