firstbacksecondback
55 Results
Workshop
|
Between the Bars: Gradient-based Jailbreaks are Bugs that induce Features Kaivalya Hariharan · Uzay Girit |
||
Workshop
|
Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference Anton Xue · Avishree Khare · Rajeev Alur · Surbhi Goel · Eric Wong |
||
Workshop
|
Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs Aidan Ewart · Abhay Sheshadri · Phillip Guo · Aengus Lynch · Cindy Wu · Vivek Hebbar · Henry Sleight · Asa Cooper Stickland · Ethan Perez · Dylan Hadfield-Menell · Stephen Casper |
||
Workshop
|
Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models Hongfu Liu · Yuxi Xie · Ye Wang · Michael Qizhe Shieh |
||
Workshop
|
Sat 10:45 |
When Do Universal Image Jailbreaks Transfer Between Vision-Language Models? Rylan Schaeffer · Dan Valentine · Luke Bailey · James Chua · Cristobal Eyzaguirre · Zane Durante · Joe Benton · Brando Miranda · Henry Sleight · Tony Wang · John Hughes · Rajashree Agrawal · Mrinank Sharma · Scott Emmons · Sanmi Koyejo · Ethan Perez |
|
Workshop
|
Failures to Find Transferable Image Jailbreaks Between Vision-Language Models Rylan Schaeffer · Dan Valentine · Luke Bailey · James Chua · Zane Durante · Cristobal Eyzaguirre · Joe Benton · Brando Miranda · Henry Sleight · Tony Wang · John Hughes · Rajashree Agrawal · Mrinank Sharma · Scott Emmons · Sanmi Koyejo · Ethan Perez |
||
Workshop
|
Decompose, Recompose, and Conquer: Multi-modal LLMs are Vulnerable to Compositional Adversarial Attacks in Multi-Image Queries Julius Broomfield · George Ingebretsen · Reihaneh Iranmanesh · Sara Pieri · Ethan Kosak-Hine · Tom Gibbs · Reihaneh Rabbany · Kellin Pelrine |
||
Workshop
|
Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in Red Teaming GenAI Ambrish Rawat · Stefan Schoepf · Giulio Zizzo · Giandomenico Cornacchia · Muhammad Zaid Hameed · Kieran Fraser · Erik Miehling · Beat Buesser · Elizabeth Daly · Mark Purcell · Prasanna Sattigeri · Pin-Yu Chen · Kush Varshney |
||
Workshop
|
Plentiful Jailbreaks with String Compositions Brian Huang |
||
Workshop
|
Does Refusal Training in LLMs Generalize to the Past Tense? Maksym Andriushchenko · Nicolas Flammarion |
||
Workshop
|
Decompose, Recompose, and Conquer: Multi-modal LLMs are Vulnerable to Compositional Adversarial Attacks in Multi-Image Queries Julius Broomfield · George Ingebretsen · Reihaneh Iranmanesh · Sara Pieri · Ethan Kosak-Hine · Tom Gibbs · Reihaneh Rabbany · Kellin Pelrine |
||
Workshop
|
Universal Jailbreak Backdoors in Large Language Model Alignment Thomas Baumann |