firstbacksecondback
49 Results
Poster
|
Fri 16:30 |
Rule Based Rewards for Language Model Safety Tong Mu · Alec Helyar · Johannes Heidecke · Joshua Achiam · Andrea Vallone · Ian Kivlichan · Molly Lin · Alex Beutel · John Schulman · Lilian Weng |
|
Poster
|
Fri 16:30 |
One-Shot Safety Alignment for Large Language Models via Optimal Dualization Xinmeng Huang · Shuo Li · Edgar Dobriban · Osbert Bastani · Hamed Hassani · Dongsheng Ding |
|
Workshop
|
Cascaded to End-to-End: New Safety, Security, and Evaluation Questions for Audio Language Models Luxi He · Xiangyu Qi · Inyoung Cheong · Prateek Mittal · Danqi Chen · Peter Henderson |
||
Workshop
|
Sun 10:55 |
Contributed Talk 2: Failures to Find Transferable Image Jailbreaks Between Vision-Language Models Rylan Schaeffer · Dan Valentine · Luke Bailey · James Chua · Zane Durante · Cristobal Eyzaguirre · Joe Benton · Brando Miranda · Henry Sleight · Tony Wang · John Hughes · Rajashree Agrawal · Mrinank Sharma · Scott Emmons · Sanmi Koyejo · Ethan Perez |
|
Workshop
|
Failures to Find Transferable Image Jailbreaks Between Vision-Language Models Rylan Schaeffer · Dan Valentine · Luke Bailey · James Chua · Zane Durante · Cristobal Eyzaguirre · Joe Benton · Brando Miranda · Henry Sleight · Tony Wang · John Hughes · Rajashree Agrawal · Mrinank Sharma · Scott Emmons · Sanmi Koyejo · Ethan Perez |
||
Workshop
|
Failures to Find Transferable Image Jailbreaks Between Vision-Language Models Rylan Schaeffer · Dan Valentine · Luke Bailey · James Chua · Zane Durante · Cristobal Eyzaguirre · Joe Benton · Brando Miranda · Henry Sleight · Tony Wang · John Hughes · Rajashree Agrawal · Mrinank Sharma · Scott Emmons · Sanmi Koyejo · Ethan Perez |
||
Workshop
|
Forget to Flourish: Leveraging Machine-Unlearning on Pretrained Language Models for Privacy Leakage Rafi Rashid · Jing Liu · Toshiaki Koike-Akino · Shagufta Mehnaz · Ye Wang |
||
Workshop
|
A Formal Framework for Assessing and Mitigating Emergent Security Risks in Generative AI Models: Bridging Theory and Dynamic Risk Mitigation aviral srivastava · Sourav Panda |
||
Workshop
|
Learning Diverse Attacks on Large Language Models for Robust Red-Teaming and Safety Tuning Seanie Lee · Minsu Kim · Lynn Cherif · David Dobre · Juho Lee · Sung Ju Hwang · Kenji Kawaguchi · Gauthier Gidel · Yoshua Bengio · Nikolay Malkin · Moksh Jain |
||
Poster
|
Fri 11:00 |
Safe LoRA: The Silver Lining of Reducing Safety Risks when Finetuning Large Language Models Chia-Yi Hsu · Yu-Lin Tsai · Chih-Hsun Lin · Pin-Yu Chen · Chia-Mu Yu · Chun-Ying Huang |
|
Workshop
|
Decompose, Recompose, and Conquer: Multi-modal LLMs are Vulnerable to Compositional Adversarial Attacks in Multi-Image Queries Julius Broomfield · George Ingebretsen · Reihaneh Iranmanesh · Sara Pieri · Ethan Kosak-Hine · Tom Gibbs · Reihaneh Rabbany · Kellin Pelrine |
||
Workshop
|
Mitigating Downstream Model Risks via Model Provenance Keyu Wang · Scott Schaffter · Abdullah Norozi Iranzad · Doina Precup · Jonathan Lebensold · Megan Risdal |