PersonaTeaming: Exploring How Introducing Personas Can Improve Automated AI Red-Teaming
Abstract
Recent developments in AI safety and Responsible AI research have called for red-teaming methods that can effectively surface potential risks posed by LLMs. Many of these calls specifically emphasize the need for red-teaming to include diverse expertise and identities, including both red teamers with domain expertise in adversarial testing of AI models and regular AI users who may encounter problematic behaviors in everyday interactions with these models. Building upon prior work on automatically generating adversarial prompts for red-teaming models, we develop and evaluate a novel red-teaming method, PersonaTeaming, that introduces personas in the adversarial prompt generation process. In particular, we first introduce a methodology for mutating prompts based on either "red-teaming expert" personas or "regular AI user" personas. We then develop a dynamic persona-generating algorithm that automatically generates various persona types adaptive to different seed prompts. In addition, we develop a set of new metrics to explicitly measure the "mutation distance" to complement existing diversity measurements of adversarial prompts. Our experiments show promising improvements (up to 144.1%) in the attack success rates of adversarial prompts through persona mutation, while maintaining prompt diversity, compared to RainbowPlus, a state-of-the-art automated red-teaming method. In addition, we analyze the strengths and limitations of different persona types and mutation methods, shedding light on future opportunities to explore complementarities between automated and human red-teaming approaches.