firstbacksecondback
55 Results
Poster
|
Fri 16:30 |
Many-shot Jailbreaking Cem Anil · Esin DURMUS · Nina Panickssery · Mrinank Sharma · Joe Benton · Sandipan Kundu · Joshua Batson · Meg Tong · Jesse Mu · Daniel Ford · Francesco Mosconi · Rajashree Agrawal · Rylan Schaeffer · Naomi Bashkansky · Samuel Svenningsen · Mike Lambert · Ansh Radhakrishnan · Carson Denison · Evan Hubinger · Yuntao Bai · Trenton Bricken · Timothy Maxwell · Nicholas Schiefer · James Sully · Alex Tamkin · Tamera Lanham · Karina Nguyen · Tomek Korbak · Jared Kaplan · Deep Ganguli · Samuel Bowman · Ethan Perez · Roger Grosse · David Duvenaud |
|
Poster
|
Wed 16:30 |
Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes Xiaomeng Hu · Pin-Yu Chen · Tsung-Yi Ho |
|
Poster
|
Fri 16:30 |
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models Liwei Jiang · Kavel Rao · Seungju Han · Allyson Ettinger · Faeze Brahman · Sachin Kumar · Niloofar Mireshghallah · Ximing Lu · Maarten Sap · Yejin Choi · Nouha Dziri |
|
Poster
|
Thu 16:30 |
Efficient LLM Jailbreak via Adaptive Dense-to-sparse Constrained Optimization Kai Hu · Weichen Yu · Yining Li · Tianjun Yao · Xiang Li · Wenhe Liu · Lijun Yu · Zhiqiang Shen · Kai Chen · Matt Fredrikson |
|
Workshop
|
Sun 11:05 |
Contributed Talk 3: LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet Nathaniel Li · Ziwen Han · Ian Steneker · Willow Primack · Riley Goodside · Hugh Zhang · Zifan Wang · Cristina Menghini · Summer Yue |
|
Workshop
|
SkewAct: Red Teaming Large Language Models via Activation-Skewed Adversarial Prompt Optimization Hanxi Guo · Siyuan Cheng · Guanhong Tao · Guangyu Shen · Zhuo Zhang · Shengwei An · Kaiyuan Zhang · Xiangyu Zhang |
||
Workshop
|
Sun 16:40 |
Contributed Talk 5: A Realistic Threat Model for Large Language Model Jailbreaks Valentyn Boreiko · Alexander Panfilov · Vaclav Voracek · Matthias Hein · Jonas Geiping |
|
Workshop
|
Infecting LLM Agents via Generalizable Adversarial Attack Weichen Yu · Kai Hu · Tianyu Pang · Chao Du · Min Lin · Matt Fredrikson |
||
Workshop
|
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet Nathaniel Li · Ziwen Han · Ian Steneker · Willow Primack · Riley Goodside · Hugh Zhang · Zifan Wang · Cristina Menghini · Summer Yue |
||
Workshop
|
TRIAGE: Ethical Benchmarking of AI Models Through Mass Casualty Simulations Nathalie Kirch · Konstantin Hebenstreit · Matthias Samwald |
||
Workshop
|
What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks Nathalie Kirch · Severin Field · Stephen Casper |
||
Poster
|
Thu 16:30 |
Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks Andy Zhou · Bo Li · Haohan Wang |