Skip to yearly menu bar Skip to main content

Workshop: Socially Responsible Language Modelling Research (SoLaR)

Eliciting Language Model Behaviors using Reverse Language Models

Jacob Pfau · Alex Infanger · Abhay Sheshadri · Ayush Panda · Julian Michael · Curtis Huebner


Language models (LMs) are used on an increasingly broad set of tasks. However, models still exhibit erratic behaviors on specific inputs, including adversarial attacks and jailbreaks. We evaluate the applicability of a reverse language model, pre-trained on inverted token-order, as a tool for automated identification of an LM's natural language failure modes. Our findings suggest that, despite the inherent difficulty of reverse prediction, reverse LMs can efficiently identify natural language prompts that produce specified outputs, outperforming gradient-based techniques. Our results suggest reverse LMs would be effective tools for finding natural language prompts on which LMs produce incorrect or toxic responses.

Chat is not available.