The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models
Ann-Kathrin Dombrowski · Dillon Bowen · Adam Gleave · Chris Cundy
Abstract
Open-weight LLMs enable innovation and democratization but introduce systemic risks: bad actors can trivially remove safeguards, creating a "safety gap''-the difference in dangerous capabilities between safeguarded and modified models.We open-source a toolkit to measure this gap across state-of-the-art models.Testing Llama-3 and Qwen-2.5 families (0.5B--405B parameters) on biochemical and cyber capabilities, we find the safety gap widens with model scale, with dangerous capabilities increasing substantially post-modification.The Safety Gap Toolkit provides an evaluation framework for open-source models and motivates tamper-resistant safeguard development.
Chat is not available.
Successful Page Load