NeurIPS Jailbreaking Language Models at Scale via Persona Modulation

Poster
in
Workshop: Socially Responsible Language Modelling Research (SoLaR)

Jailbreaking Language Models at Scale via Persona Modulation

Rusheb Shah · Quentin Feuillade Montixi · Soroush Pour · Arush Tagade · Javier Rando

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Despite significant efforts to align large language models to produce harmless responses, their safety mechanisms are still vulnerable to prompts that elicit undesirable behaviour: jailbreaks. In this work, we investigate persona modulation as a black-box jailbreak that steers the target model to take on particular personalities (personas) that are more likely to comply with harmful instructions. We demonstrate a range of societally harmful completions made possible by persona modulation, including detailed instructions for synthesising methamphetamine, building a bomb, and laundering money. We show that persona modulation can be automated to exploit this vulnerability at scale. We achieve this by using a novel jailbreak prompt that gets a language model to generate jailbreak prompts for arbitrary topics rather than manually crafting a jailbreak prompt for each persona. Persona modulation leads to high attack success rates against GPT-4 and find that the prompts are fully transferable to other state-of-the-art models such as Claude 2 and Vicuna. Our work expands the attack surface for misuse and highlights new vulnerabilities in large language models.

Chat is not available.

Poster in Workshop: Socially Responsible Language Modelling Research (SoLaR)

Jailbreaking Language Models at Scale via Persona Modulation

Rusheb Shah · Quentin Feuillade Montixi · Soroush Pour · Arush Tagade · Javier Rando

Poster
in
Workshop: Socially Responsible Language Modelling Research (SoLaR)