firstbacksecondback
73 Results
Poster
|
Fri 16:30 |
Improving Alignment and Robustness with Circuit Breakers Andy Zou · Long Phan · Justin Wang · Derek Duenas · Maxwell Lin · Maksym Andriushchenko · J. Zico Kolter · Matt Fredrikson · Dan Hendrycks |
|
Poster
|
Thu 16:30 |
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? Richard Ren · Steven Basart · Adam Khoja · Alice Gatti · Long Phan · Xuwang Yin · Mantas Mazeika · Alexander Pan · Gabriel Mukobi · Ryan Kim · Stephen Fitz · Dan Hendrycks |
|
Workshop
|
Position: Addressing Ethical Challenges and Safety Risks in GenAI-Powered Brain-Computer Interfaces Konstantinos Barmpas · Georgios Zoumpourlis · Yannis Panagakis · Dimitrios Adamos · N Laskaris · Stefanos Zafeiriou |
||
Poster
|
Thu 16:30 |
WildGuard: Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs Seungju Han · Kavel Rao · Allyson Ettinger · Liwei Jiang · Bill Yuchen Lin · Nathan Lambert · Yejin Choi · Nouha Dziri |
|
Workshop
|
Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs Aidan Ewart · Abhay Sheshadri · Phillip Guo · Aengus Lynch · Cindy Wu · Vivek Hebbar · Henry Sleight · Asa Cooper Stickland · Ethan Perez · Dylan Hadfield-Menell · Stephen Casper |
||
Workshop
|
HarmAnalyst: Interpretable, transparent, and steerable LLM safety moderation Jing-Jing Li · Valentina Pyatkin · Max Kleiman-Weiner · Liwei Jiang · Nouha Dziri · Anne Collins · Jana Schaich Borg · Maarten Sap · Yejin Choi · Sydney Levine |
||
Workshop
|
Failures to Find Transferable Image Jailbreaks Between Vision-Language Models Rylan Schaeffer · Dan Valentine · Luke Bailey · James Chua · Zane Durante · Cristobal Eyzaguirre · Joe Benton · Brando Miranda · Henry Sleight · Tony Wang · John Hughes · Rajashree Agrawal · Mrinank Sharma · Scott Emmons · Sanmi Koyejo · Ethan Perez |
||
Workshop
|
Failures to Find Transferable Image Jailbreaks Between Vision-Language Models Rylan Schaeffer · Dan Valentine · Luke Bailey · James Chua · Zane Durante · Cristobal Eyzaguirre · Joe Benton · Brando Miranda · Henry Sleight · Tony Wang · John Hughes · Rajashree Agrawal · Mrinank Sharma · Scott Emmons · Sanmi Koyejo · Ethan Perez |
||
Workshop
|
Simulation System Towards Solving Societal-Scale Manipulation Maximilian Puelma Touzel · Sneheel Sarangi · Austin Welch · Gayatri K · Dan Zhao · Zachary Yang · Hao Yu · Tom Gibbs · Ethan Kosak-Hine · Andreea Musulan · Camille Thibault · Reihaneh Rabbany · Jean-François Godbout · Kellin Pelrine |
||
Workshop
|
Sun 10:55 |
Contributed Talk 2: Failures to Find Transferable Image Jailbreaks Between Vision-Language Models Rylan Schaeffer · Dan Valentine · Luke Bailey · James Chua · Zane Durante · Cristobal Eyzaguirre · Joe Benton · Brando Miranda · Henry Sleight · Tony Wang · John Hughes · Rajashree Agrawal · Mrinank Sharma · Scott Emmons · Sanmi Koyejo · Ethan Perez |
|
Workshop
|
Learning Diverse Attacks on Large Language Models for Robust Red-Teaming and Safety Tuning Seanie Lee · Minsu Kim · Lynn Cherif · David Dobre · Juho Lee · Sung Ju Hwang · Kenji Kawaguchi · Gauthier Gidel · Yoshua Bengio · Nikolay Malkin · Moksh Jain |
||
Workshop
|
Decompose, Recompose, and Conquer: Multi-modal LLMs are Vulnerable to Compositional Adversarial Attacks in Multi-Image Queries Julius Broomfield · George Ingebretsen · Reihaneh Iranmanesh · Sara Pieri · Ethan Kosak-Hine · Tom Gibbs · Reihaneh Rabbany · Kellin Pelrine |