firstbacksecondback
55 Results
Poster
|
Wed 11:00 |
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses Xiaosen Zheng · Tianyu Pang · Chao Du · Qian Liu · Jing Jiang · Min Lin |
|
Poster
|
Fri 11:00 |
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models Patrick Chao · Edoardo Debenedetti · Alexander Robey · Maksym Andriushchenko · Francesco Croce · Vikash Sehwag · Edgar Dobriban · Nicolas Flammarion · George J. Pappas · Florian Tramer · Hamed Hassani · Eric Wong |
|
Poster
|
Thu 16:30 |
A StrongREJECT for Empty Jailbreaks Alexandra Souly · Qingyuan Lu · Dillon Bowen · Tu Trinh · Elvis Hsieh · Sana Pandey · Pieter Abbeel · Justin Svegliato · Scott Emmons · Olivia Watkins · Sam Toyer |
|
Poster
|
ColJailBreak: Collaborative Generation and Editing for Jailbreaking Text-to-Image Deep Generation Yizhuo Ma · Shanmin Pang · Qi Guo · Tianyu Wei · Qing Guo |
||
Poster
|
Thu 11:00 |
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs Zhao Xu · Fan LIU · Hao Liu |
|
Poster
|
Thu 16:30 |
WildGuard: Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs Seungju Han · Kavel Rao · Allyson Ettinger · Liwei Jiang · Bill Yuchen Lin · Nathan Lambert · Yejin Choi · Nouha Dziri |
|
Poster
|
Thu 11:00 |
Mission Impossible: A Statistical Perspective on Jailbreaking LLMs Jingtong Su · Julia Kempe · Karen Ullrich |
|
Poster
|
Thu 16:30 |
Fight Back Against Jailbreaking via Prompt Adversarial Tuning Yichuan Mo · Yuji Wang · Zeming Wei · Yisen Wang |
|
Poster
|
Thu 11:00 |
BackdoorAlign: Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment Jiongxiao Wang · Jiazhao LI · Yiquan Li · Xiangyu Qi · Junjie Hu · Sharon Li · Patrick McDaniel · Muhao Chen · Bo Li · Chaowei Xiao |
|
Poster
|
Fri 16:30 |
Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters Haibo Jin · Andy Zhou · Joe Menke · Haohan Wang |
|
Poster
|
Thu 11:00 |
Tree of Attacks: Jailbreaking Black-Box LLMs Automatically Anay Mehrotra · Manolis Zampetakis · Paul Kassianik · Blaine Nelson · Hyrum Anderson · Yaron Singer · Amin Karbasi |
|
Poster
|
Thu 16:30 |
Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks Andy Zhou · Bo Li · Haohan Wang |