Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Statistical Frontiers in LLMs and Foundation Models

Cannot or Should Not? Automatic Analysis of Refusal Composition in IFT/RLHF Datasets and Refusal Behavior of Black-Box LLMs

Alexander von Recum · Christoph Schnabl · Gabor Hollbeck · Marvin von Hagen · Silas Alberti · Philip Blinde

Keywords: [ Datasets ] [ Refusals ] [ Black Box Audits ] [ LLMs ]

[ ] [ Project Page ]
Sat 14 Dec noon PST — 12:45 p.m. PST

Abstract:

Refusals – instances where large language models (LLMs) decline or fail to fullyexecute user instructions – are crucial for both AI safety and AI capabilities, mostimportantly the reduction of hallucinations. These behaviors are learned duringpost-training, especially in instruction fine-tuning (IFT) and reinforcement learningfrom human feedback (RLHF). However, existing taxonomies and evaluationdatasets for refusals are inadequate, often focusing solely on should-not-related(instead of cannot-related) categories, and lacking tools for auditing refusal contentin black-box LLM outputs.We present a comprehensive framework for classifying LLM refusals: (a) a taxon-omy of 16 refusal categories, (b) a human-annotated dataset of over 8,400 instancesfrom publicly available IFT and RLHF datasets, (c) a synthetic dataset with 5,000examples for each refusal category, and (d) a GPT-4o-mini-based refusal classifierfine-tuned on on both human-annotated and synthetic data.Our work enables precise auditing of refusal behaviors in black-box LLMs andautomatic analyses of refusal patterns in large IFT and RLHF datasets. Thisfacilitates the strategic adjustment of LLM refusals, contributing to the developmentof more safe and reliable LLMs.

Chat is not available.