Poster
in
Workshop: Statistical Frontiers in LLMs and Foundation Models
Cannot or Should Not? Automatic Analysis of Refusal Composition in IFT/RLHF Datasets and Refusal Behavior of Black-Box LLMs
Alexander von Recum · Christoph Schnabl · Gabor Hollbeck · Marvin von Hagen · Silas Alberti · Philip Blinde
Keywords: [ Datasets ] [ Refusals ] [ Black Box Audits ] [ LLMs ]
Refusals – instances where large language models (LLMs) decline or fail to fullyexecute user instructions – are crucial for both AI safety and AI capabilities, mostimportantly the reduction of hallucinations. These behaviors are learned duringpost-training, especially in instruction fine-tuning (IFT) and reinforcement learningfrom human feedback (RLHF). However, existing taxonomies and evaluationdatasets for refusals are inadequate, often focusing solely on should-not-related(instead of cannot-related) categories, and lacking tools for auditing refusal contentin black-box LLM outputs.We present a comprehensive framework for classifying LLM refusals: (a) a taxon-omy of 16 refusal categories, (b) a human-annotated dataset of over 8,400 instancesfrom publicly available IFT and RLHF datasets, (c) a synthetic dataset with 5,000examples for each refusal category, and (d) a GPT-4o-mini-based refusal classifierfine-tuned on on both human-annotated and synthetic data.Our work enables precise auditing of refusal behaviors in black-box LLMs andautomatic analyses of refusal patterns in large IFT and RLHF datasets. Thisfacilitates the strategic adjustment of LLM refusals, contributing to the developmentof more safe and reliable LLMs.