firstbacksecondback
11 Results
Workshop
|
Learning Diverse Attacks on Large Language Models for Robust Red-Teaming and Safety Tuning Seanie Lee · Minsu Kim · Lynn Cherif · David Dobre · Juho Lee · Sung Ju Hwang · Kenji Kawaguchi · Gauthier Gidel · Yoshua Bengio · Nikolay Malkin · Moksh Jain |
||
Workshop
|
Safety-Aware Fine-Tuning of Large Language Models Hyeong Kyu Choi · Xuefeng Du · Sharon Li |
||
Workshop
|
Representation Tuning Christopher Ackerman |
||
Workshop
|
Sun 11:21 |
Rui Ye, Jingyi Chai, Xiangrui Liu, Yaodong Yang, Yanfeng Wang & Siheng Chen. Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models |
|
Workshop
|
Sat 15:45 |
vTune: Verifiable fine-tuning Through Backdooring Eva Zhang · Akilesh Potti · Micah Goldblum |
|
Workshop
|
Can Safety Fine-Tuning Be More Principled? Lessons Learned from Cybersecurity David Williams-King · Linh Le · Adam Oberman · Yoshua Bengio |
||
Poster
|
Thu 11:00 |
BackdoorAlign: Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment Jiongxiao Wang · Jiazhao LI · Yiquan Li · Xiangyu Qi · Junjie Hu · Sharon Li · Patrick McDaniel · Muhao Chen · Bo Li · Chaowei Xiao |
|
Poster
|
Wed 16:30 |
What Makes and Breaks Safety Fine-tuning? A Mechanistic Study Samyak Jain · Ekdeep S Lubana · Kemal Oksuz · Tom Joy · Philip Torr · Amartya Sanyal · Puneet Dokania |
|
Workshop
|
Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models Rui Ye · Jingyi Chai · Xiangrui Liu · Yaodong Yang · Yanfeng Wang · Siheng Chen |
||
Workshop
|
Preserving Safety in Fine-Tuned Large Language Models: A Systematic Evaluation and Mitigation Strategy Tsung-Huan Yang · Ko-Wei Huang · Yung-Hui Li · Lun-Wei Ku |
||
Poster
|
Fri 16:30 |
Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack Tiansheng Huang · Sihao Hu · Fatih Ilhan · Selim Tekin · Ling Liu |