Skip to yearly menu bar Skip to main content


Search All 2024 Events
 

55 Results

<<   <   Page 2 of 5   >   >>
Poster
Fri 16:30 Many-shot Jailbreaking
Cem Anil · Esin DURMUS · Nina Panickssery · Mrinank Sharma · Joe Benton · Sandipan Kundu · Joshua Batson · Meg Tong · Jesse Mu · Daniel Ford · Francesco Mosconi · Rajashree Agrawal · Rylan Schaeffer · Naomi Bashkansky · Samuel Svenningsen · Mike Lambert · Ansh Radhakrishnan · Carson Denison · Evan Hubinger · Yuntao Bai · Trenton Bricken · Timothy Maxwell · Nicholas Schiefer · James Sully · Alex Tamkin · Tamera Lanham · Karina Nguyen · Tomek Korbak · Jared Kaplan · Deep Ganguli · Samuel Bowman · Ethan Perez · Roger Grosse · David Duvenaud
Poster
Wed 16:30 Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes
Xiaomeng Hu · Pin-Yu Chen · Tsung-Yi Ho
Poster
Fri 16:30 WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
Liwei Jiang · Kavel Rao · Seungju Han · Allyson Ettinger · Faeze Brahman · Sachin Kumar · Niloofar Mireshghallah · Ximing Lu · Maarten Sap · Yejin Choi · Nouha Dziri
Poster
Thu 16:30 Efficient LLM Jailbreak via Adaptive Dense-to-sparse Constrained Optimization
Kai Hu · Weichen Yu · Yining Li · Tianjun Yao · Xiang Li · Wenhe Liu · Lijun Yu · Zhiqiang Shen · Kai Chen · Matt Fredrikson
Workshop
Sun 11:05 Contributed Talk 3: LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
Nathaniel Li · Ziwen Han · Ian Steneker · Willow Primack · Riley Goodside · Hugh Zhang · Zifan Wang · Cristina Menghini · Summer Yue
Workshop
SkewAct: Red Teaming Large Language Models via Activation-Skewed Adversarial Prompt Optimization
Hanxi Guo · Siyuan Cheng · Guanhong Tao · Guangyu Shen · Zhuo Zhang · Shengwei An · Kaiyuan Zhang · Xiangyu Zhang
Workshop
Sun 16:40 Contributed Talk 5: A Realistic Threat Model for Large Language Model Jailbreaks
Valentyn Boreiko · Alexander Panfilov · Vaclav Voracek · Matthias Hein · Jonas Geiping
Workshop
Infecting LLM Agents via Generalizable Adversarial Attack
Weichen Yu · Kai Hu · Tianyu Pang · Chao Du · Min Lin · Matt Fredrikson
Workshop
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
Nathaniel Li · Ziwen Han · Ian Steneker · Willow Primack · Riley Goodside · Hugh Zhang · Zifan Wang · Cristina Menghini · Summer Yue
Workshop
TRIAGE: Ethical Benchmarking of AI Models Through Mass Casualty Simulations
Nathalie Kirch · Konstantin Hebenstreit · Matthias Samwald
Workshop
What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks
Nathalie Kirch · Severin Field · Stephen Casper
Poster
Thu 16:30 Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
Andy Zhou · Bo Li · Haohan Wang