Poster
in
Workshop: Lock-LLM Workshop: Prevent Unauthorized Knowledge Use from Large Language Models - Deep Dive into Un-Distillate, Un-Finetunable, Un-Compressible, Un-Editable, and Un-Usable

The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models

Ann-Kathrin Dombrowski ⋅ Dillon Bowen ⋅ Adam Gleave ⋅ Chris Cundy

Project Page [ OpenReview]

Abstract

Open-weight LLMs enable innovation and democratization but introduce systemic risks: bad actors can trivially remove safeguards, creating a "safety gap''-the difference in dangerous capabilities between safeguarded and modified models.We open-source a toolkit to measure this gap across state-of-the-art models.Testing Llama-3 and Qwen-2.5 families (0.5B--405B parameters) on biochemical and cyber capabilities, we find the safety gap widens with model scale, with dangerous capabilities increasing substantially post-modification.The Safety Gap Toolkit provides an evaluation framework for open-source models and motivates tamper-resistant safeguard development.

Chat is not available.