Workshop
Red Teaming GenAI: What Can We Learn from Adversaries?
Valeriia Cherepanova 路 Bo Li 路 Niv Cohen 路 Yifei Wang 路 Yisen Wang 路 Avital Shafran 路 Nil-Jana Akpinar 路 James Zou
West Meeting Room 301
Sun 15 Dec, 9 a.m. PST
The development and proliferation of modern generative AI models has introduced valuable capabilities, but these models and their applications also introduce risks to human safety. How do we identify risks in new systems before they cause harm during deployment? This workshop focuses on red teaming, an emergent adversarial approach to probing model behaviors, and its applications towards making modern generative AI safe for humans.
Chat is not available.
Timezone: America/Los_Angeles
Schedule
Sun 9:00 a.m. - 9:30 a.m.
|
Coffee break
|
馃敆 |
Sun 9:30 a.m. - 9:35 a.m.
|
Opening Remark
|
馃敆 |
Sun 9:35 a.m. - 10:10 a.m.
|
Invited talk 1: Andy Zou and Q&A
(
Invited Talk
)
>
SlidesLive Video |
Andy Zou 馃敆 |
Sun 10:10 a.m. - 10:45 a.m.
|
Invited talk 2: Danqi Chen on Uncovering Simple Failures in Generative Models and How to Fix Them
(
Invited Talk
)
>
SlidesLive Video |
Danqi Chen 馃敆 |
Sun 10:45 a.m. - 10:55 a.m.
|
Contributed Talk 1: iART - Imitation guided Automated Red Teaming
(
Oral
)
>
link
SlidesLive Video |
Sajad Mousavi 路 Desik Rengarajan 路 Ashwin Ramesh Babu 路 Vineet Gundecha 路 Avisek Naug 路 Sahand Ghorbanpour 路 Ricardo Luna Gutierrez 路 Antonio Guillen-Perez 路 Paolo Faraboschi 路 Soumyendu Sarkar 馃敆 |
Sun 10:55 a.m. - 11:05 a.m.
|
Contributed Talk 2: Failures to Find Transferable Image Jailbreaks Between Vision-Language Models
(
Oral
)
>
link
SlidesLive Video |
16 presentersRylan Schaeffer 路 Dan Valentine 路 Luke Bailey 路 James Chua 路 Zane Durante 路 Cristobal Eyzaguirre 路 Joe Benton 路 Brando Miranda 路 Henry Sleight 路 Tony Wang 路 John Hughes 路 Rajashree Agrawal 路 Mrinank Sharma 路 Scott Emmons 路 Sanmi Koyejo 路 Ethan Perez |
Sun 11:05 a.m. - 11:15 a.m.
|
Contributed Talk 3: LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
(
Oral
)
>
link
SlidesLive Video |
Nathaniel Li 路 Ziwen Han 路 Ian Steneker 路 Willow Primack 路 Riley Goodside 路 Hugh Zhang 路 Zifan Wang 路 Cristina Menghini 路 Summer Yue 馃敆 |
Sun 11:20 a.m. - 12:00 p.m.
|
Panel Discussion
(
Panel
)
>
SlidesLive Video |
Roei Schuster 路 Yaron Singer 路 Alex Tamkin 路 Bo Li 馃敆 |
Sun 12:00 p.m. - 1:00 p.m.
|
Lunch
(
Lunch
)
>
|
馃敆 |
Sun 12:00 p.m. - 1:50 p.m.
|
Poster Session
(
Poster Session
)
>
|
馃敆 |
Sun 1:50 p.m. - 2:15 p.m.
|
Invited talk 3: Niloofar Mireshghallah on A False Sense of Privacy: Semantic Leakage and Non-literal Copying in LLMs
(
Invited Talk
)
>
SlidesLive Video |
Niloofar Mireshghallah 馃敆 |
Sun 2:15 p.m. - 3:00 p.m.
|
Invited talk 4: Jonas Geiping on When do adversarial attacks against language models matter?
(
Invited Talk
)
>
SlidesLive Video |
Jonas Geiping 馃敆 |
Sun 3:00 p.m. - 3:30 p.m.
|
Coffee Break
|
馃敆 |
Sun 3:30 p.m. - 4:15 p.m.
|
Invited talk 5: Vitaly Shmatikov and Q&A
(
Invited Talk
)
>
SlidesLive Video |
Vitaly Shmatikov 馃敆 |
Sun 4:15 p.m. - 4:30 p.m.
|
Invited talk 6: Gowthami Somepalli and Q&A
(
Invited Talk
)
>
SlidesLive Video |
Gowthami Somepalli 馃敆 |
Sun 4:30 p.m. - 4:40 p.m.
|
Contributed Talk 4: Rethinking LLM Memorization through the Lens of Adversarial Compression
(
Oral
)
>
link
SlidesLive Video |
Avi Schwarzschild 路 Zhili Feng 路 Pratyush Maini 路 Zachary Lipton 路 J. Zico Kolter 馃敆 |
Sun 4:40 p.m. - 4:50 p.m.
|
Contributed Talk 5: A Realistic Threat Model for Large Language Model Jailbreaks
(
Oral
)
>
link
SlidesLive Video |
Valentyn Boreiko 路 Alexander Panfilov 路 Vaclav Voracek 路 Matthias Hein 路 Jonas Geiping 馃敆 |
Sun 4:50 p.m. - 5:00 p.m.
|
Contributed Talk 6: Infecting LLM Agents via Generalizable Adversarial Attack
(
Oral
)
>
link
SlidesLive Video |
Weichen Yu 路 Kai Hu 路 Tianyu Pang 路 Chao Du 路 Min Lin 路 Matt Fredrikson 馃敆 |
Sun 5:00 p.m. - 5:20 p.m.
|
Invited Talk 7: Max Kaufmann on Red-teaming AI systems in government
(
Invited Talk
)
>
SlidesLive Video |
Max Kaufmann 馃敆 |
Sun 5:20 p.m. - 5:30 p.m.
|
Closing Remarks
SlidesLive Video |
馃敆 |
-
|
Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in Red Teaming GenAI ( Poster ) > link |
13 presentersAmbrish Rawat 路 Stefan Schoepf 路 Giulio Zizzo 路 Giandomenico Cornacchia 路 Muhammad Zaid Hameed 路 Kieran Fraser 路 Erik Miehling 路 Beat Buesser 路 Elizabeth Daly 路 Mark Purcell 路 Prasanna Sattigeri 路 Pin-Yu Chen 路 Kush Varshney |
-
|
Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning ( Poster ) > link | Alex Beutel 路 Kai Xiao 路 Johannes Heidecke 路 Lilian Weng 馃敆 |
-
|
MedAIScout: Automated Retrieval of Known Machine Learning Vulnerabilities in Medical Applications ( Poster ) > link | Athish Pranav Dharmalingam 路 Gargi Mitra 馃敆 |
-
|
Infecting LLM Agents via Generalizable Adversarial Attack ( Poster ) > link | Weichen Yu 路 Kai Hu 路 Tianyu Pang 路 Chao Du 路 Min Lin 路 Matt Fredrikson 馃敆 |
-
|
Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints ( Poster ) > link | Jonathan Noether 路 Adish Singla 路 Goran Radanovic 馃敆 |
-
|
Keep on Swimming: Real Attackers Only Need Partial Knowledge of a Multi-Model System ( Poster ) > link | Julian Collado 路 Kevin Stangl 馃敆 |
-
|
Decoding Biases: An Analysis of Automated Methods and Metrics for Gender Bias Detection in Language Models ( Poster ) > link | Shachi H. Kumar 路 Saurav Sahay 路 Sahisnu Mazumder 路 Eda Okur 路 Ramesh Manuvinakurike 路 Nicole Beckage 路 Hsuan Su 路 Hung-yi Lee 路 Lama Nachman 馃敆 |
-
|
Interactive Semantic Interventions for VLMs: Breaking VLMs with Human Ingenuity ( Poster ) > link | Lukas Klein 路 Kenza Amara 路 Carsten L眉th 路 Hendrik Strobelt 路 Mennatallah El-Assady 路 Paul Jaeger 馃敆 |
-
|
Semantic Membership Inference Attack against Large Language Models ( Poster ) > link | Hamid Mozaffari 路 Virendra Marathe 馃敆 |
-
|
Rethinking LLM Memorization through the Lens of Adversarial Compression ( Poster ) > link | Avi Schwarzschild 路 Zhili Feng 路 Pratyush Maini 路 Zachary Lipton 路 J. Zico Kolter 馃敆 |
-
|
Learning Diverse Attacks on Large Language Models for Robust Red-Teaming and Safety Tuning ( Poster ) > link |
11 presentersSeanie Lee 路 Minsu Kim 路 Lynn Cherif 路 David Dobre 路 Juho Lee 路 Sung Ju Hwang 路 Kenji Kawaguchi 路 Gauthier Gidel 路 Yoshua Bengio 路 Nikolay Malkin 路 Moksh Jain |
-
|
Between the Bars: Gradient-based Jailbreaks are Bugs that induce Features ( Poster ) > link | Kaivalya Hariharan 路 Uzay Girit 馃敆 |
-
|
Large Language Model Detoxification: Data and Metric Solutions ( Poster ) > link | SungJoo Byun 路 HYOPIL SHIN 馃敆 |
-
|
SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming ( Poster ) > link | Anurakt Kumar 路 Divyanshu Kumar 路 Jatan Loya 路 Nitin Aravind Birur 路 Tanay Baswa 路 Sahil Agarwal 路 Prashanth Harshangi 馃敆 |
-
|
An Adversarial Perspective on Machine Unlearning for AI Safety ( Poster ) > link | Jakub 艁ucki 路 Boyi Wei 路 Yangsibo Huang 路 Peter Henderson 路 Florian Tramer 路 Javier Rando 馃敆 |
-
|
Stability Evaluation of Large Language Models via Distributional Perturbation Analysis ( Poster ) > link | Jiashuo Liu 路 Jiajin Li 路 Peng Cui 路 Jose Blanchet 馃敆 |
-
|
Lessons From Red Teaming 100 Generative AI Products ( Poster ) > link |
20 presentersBlake Bullwinkel 路 Amanda Minnich 路 Shiven Chawla 路 Gary Lopez Munoz 路 Martin Pouliot 路 Whitney Maxwell 路 Joris de Gruyter 路 Katherine Pratt 路 Saphir Qi 路 Nina Chikanov 路 Roman Lutz 路 Raja Sekhar Rao Dheekonda 路 Bolor-Erdene Jagdagdorj 路 Rich Lundeen 路 Sam Vaughan 路 Victoria Westerhoff 路 Pete Bryan 路 Ram Shankar Siva Kumar 路 Yonatan Zunger 路 Mark Russinovich |
-
|
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet ( Poster ) > link | Nathaniel Li 路 Ziwen Han 路 Ian Steneker 路 Willow Primack 路 Riley Goodside 路 Hugh Zhang 路 Zifan Wang 路 Cristina Menghini 路 Summer Yue 馃敆 |
-
|
Forget to Flourish: Leveraging Machine-Unlearning on Pretrained Language Models for Privacy Leakage ( Poster ) > link | Rafi Rashid 路 Jing Liu 路 Toshiaki Koike-Akino 路 Shagufta Mehnaz 路 Ye Wang 馃敆 |
-
|
Steganography in Large Language Models: Investigating Emergence and Mitigations ( Poster ) > link | Yohan Mathew 路 Robert McCarthy 路 Ollie Matthews 路 Joan Velja 路 Nandi Schoots 路 Dylan Cope 馃敆 |
-
|
A Realistic Threat Model for Large Language Model Jailbreaks ( Poster ) > link | Valentyn Boreiko 路 Alexander Panfilov 路 Vaclav Voracek 路 Matthias Hein 路 Jonas Geiping 馃敆 |
-
|
Decompose, Recompose, and Conquer: Multi-modal LLMs are Vulnerable to Compositional Adversarial Attacks in Multi-Image Queries ( Poster ) > link | Julius Broomfield 路 George Ingebretsen 路 Reihaneh Iranmanesh 路 Sara Pieri 路 Ethan Kosak-Hine 路 Tom Gibbs 路 Reihaneh Rabbany 路 Kellin Pelrine 馃敆 |
-
|
Failures to Find Transferable Image Jailbreaks Between Vision-Language Models ( Poster ) > link |
16 presentersRylan Schaeffer 路 Dan Valentine 路 Luke Bailey 路 James Chua 路 Zane Durante 路 Cristobal Eyzaguirre 路 Joe Benton 路 Brando Miranda 路 Henry Sleight 路 Tony Wang 路 John Hughes 路 Rajashree Agrawal 路 Mrinank Sharma 路 Scott Emmons 路 Sanmi Koyejo 路 Ethan Perez |
-
|
SkewAct: Red Teaming Large Language Models via Activation-Skewed Adversarial Prompt Optimization ( Poster ) > link | Hanxi Guo 路 Siyuan Cheng 路 Guanhong Tao 路 Guangyu Shen 路 Zhuo Zhang 路 Shengwei An 路 Kaiyuan Zhang 路 Xiangyu Zhang 馃敆 |
-
|
Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs ( Poster ) > link | Aly Kassem 路 Omar Mahmoud 路 Niloofar Mireshghallah 路 Hyunwoo Kim 路 Yulia Tsvetkov 路 Yejin Choi 路 Sherif Saad 路 Santu Rana 馃敆 |
-
|
TOFU: A Task of Fictitious Unlearning for LLMs ( Poster ) > link | Pratyush Maini 路 Zhili Feng 路 Avi Schwarzschild 路 Zachary Lipton 路 J. Zico Kolter 馃敆 |
-
|
iART - Imitation guided Automated Red Teaming ( Poster ) > link | Sajad Mousavi 路 Desik Rengarajan 路 Ashwin Ramesh Babu 路 Vineet Gundecha 路 Avisek Naug 路 Sahand Ghorbanpour 路 Ricardo Luna Gutierrez 路 Antonio Guillen-Perez 路 Paolo Faraboschi 路 Soumyendu Sarkar 馃敆 |
-
|
Does Refusal Training in LLMs Generalize to the Past Tense? ( Poster ) > link | Maksym Andriushchenko 路 Nicolas Flammarion 馃敆 |
-
|
Plentiful Jailbreaks with String Compositions ( Poster ) > link | Brian Huang 馃敆 |
-
|
Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models ( Poster ) > link | Hongfu Liu 路 Yuxi Xie 路 Ye Wang 路 Michael Qizhe Shieh 馃敆 |
-
|
CopyBench: Measuring Literal and Non-Literal Reproduction of Copyright-Protected Text in Language Model Generation ( Poster ) > link | Tong Chen 路 Akari Asai 路 Niloofar Mireshghallah 路 Sewon Min 路 James Grimmelmann 路 Yejin Choi 路 Hannaneh Hajishirzi 路 Luke Zettlemoyer 路 Pang Wei Koh 馃敆 |
-
|
Curiosity-driven Red teaming for Large Language Models ( Poster ) > link | Zhang-Wei Hong 路 Idan Shenfeld 路 Tsun-Hsuan Johnson Wang 路 Yung-Sung Chuang 路 Aldo Pareja 路 Jim Glass 路 Akash Srivastava 路 Pulkit Agrawal 馃敆 |
-
|
Adversarial Negotiation Dynamics in Generative Language Models ( Poster ) > link | Arinbj枚rn Kolbeinsson 路 Benedikt Kolbeinsson 馃敆 |
-
|
LLM-Assisted Red Teaming of Diffusion Models through "Failures Are Fated, But Can Be Faded" ( Poster ) > link | Som Sagar 路 Aditya Taparia 路 Ransalu Senanayake 馃敆 |
-
|
Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding ( Poster ) > link | Haneul Yoo 路 Yongjin Yang 路 Hwaran Lee 馃敆 |
-
|
What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks ( Poster ) > link | Nathalie Kirch 路 Severin Field 路 Stephen Casper 馃敆 |
-
|
Algorithmic Oversight for Deceptive Reasoning ( Poster ) > link | Ege Onur Taga 路 Mingchen Li 路 Yongqi Chen 路 Samet Oymak 馃敆 |
-
|
A Formal Framework for Assessing and Mitigating Emergent Security Risks in Generative AI Models: Bridging Theory and Dynamic Risk Mitigation ( Poster ) > link | aviral srivastava 路 Sourav Panda 馃敆 |