firstbacksecondback
68 Results
Workshop
|
Inducing Human-like Biases in Moral Reasoning Language Models Austin Meek · Artem Karpov · Seong Cho · Raymond Koopmanschap · Lucy Farnik · Bogdan-Ionut Cirstea |
||
Poster
|
Wed 16:30 |
Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates Kaifeng Lyu · Haoyu Zhao · Xinran Gu · Dingli Yu · Anirudh Goyal · Sanjeev Arora |
|
Poster
|
Wed 11:00 |
Evaluating alignment between humans and neural network representations in image-based learning tasks Can Demircan · Tankred Saanum · Leonardo Pettini · Marcel Binz · Blazej Baczkowski · Christian Doeller · Mona Garvert · Eric Schulz |
|
Poster
|
Wed 11:00 |
Distributional Preference Alignment of LLMs via Optimal Transport Igor Melnyk · Youssef Mroueh · Brian Belgodere · Mattia Rigotti · Apoorva Nitsure · Mikhail Yurochkin · Kristjan Greenewald · Jiri Navratil · Jarret Ross |
|
Poster
|
Thu 11:00 |
BackdoorAlign: Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment Jiongxiao Wang · Jiazhao LI · Yiquan Li · Xiangyu Qi · Junjie Hu · Sharon Li · Patrick McDaniel · Muhao Chen · Bo Li · Chaowei Xiao |
|
Poster
|
Fri 16:30 |
Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack Tiansheng Huang · Sihao Hu · Fatih Ilhan · Selim Tekin · Ling Liu |
|
Poster
|
Fri 16:30 |
Improving Alignment and Robustness with Circuit Breakers Andy Zou · Long Phan · Justin Wang · Derek Duenas · Maxwell Lin · Maksym Andriushchenko · J. Zico Kolter · Matt Fredrikson · Dan Hendrycks |
|
Poster
|
Thu 11:00 |
Aligning Large Language Models with Representation Editing: A Control Perspective Lingkai Kong · Haorui Wang · Wenhao Mu · Yuanqi Du · Yuchen Zhuang · Yifei Zhou · Yue Song · Rongzhi Zhang · Kai Wang · Chao Zhang |