Poster
The Art of Saying No: Contextual Noncompliance in Language Models
Faeze Brahman · Sachin Kumar · Vidhisha Balachandran · Pradeep Dasigi · Valentina Pyatkin · Abhilasha Ravichander · Sarah Wiegreffe · Nouha Dziri · Khyathi Chandu · Jack Hessel · Yulia Tsvetkov · Noah Smith · Yejin Choi · Hannaneh Hajishirzi
Chat-based language models are designed to be helpful, yet they should not comply with every user request. While most existing work primarily focuses on refusal of ``unsafe'' queries, we posit that the scope of noncompliance should be broadened. We introduce a comprehensive taxonomy of contextual noncompliance describing when and how models should not comply with user requests. Our taxonomy spans a wide range of categories including incomplete, unsupported, indeterminate, and humanizing requests (in addition to unsafe requests). To test noncompliance capabilities of language models, we use this taxonomy to develop a new evaluation suite of 1000 noncompliance prompts. We find that most existing models show significantly high compliance rates in certain previously understudied categories with models like GPT-4 incorrectly complying with as many as 30\% of requests.To address these gaps, we explore different training strategies using a synthetically-generated training set of requests and expected noncompliant responses. Our experiments demonstrate that while direct finetuning of instruction-tuned models can lead to both over-refusal and a decline in general capabilities, using parameter efficient methods like low rank adapters helps to strike a good balance between appropriate noncompliance and other capabilities.
Live content is unavailable. Log in and register to view live content