firstbacksecondback
73 Results
Workshop
|
Sun 9:00 |
Towards Safe & Trustworthy Agents Alexander Pan · Kimin Lee · Bo Li · Karthik Narasimhan · Dawn Song · Isabelle Barrass |
|
Workshop
|
MISR: Measuring Instrumental Self-Reasoning in Frontier Models Kai Fronsdal · David Lindner |
||
Workshop
|
Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy Tong Wu · Shujian Zhang · Kaiqiang Song · Silei Xu · Sanqiang Zhao · Ravi Agrawal · Sathish Indurthi · Chong Xiang · Prateek Mittal · Wenxuan Zhou |
||
Workshop
|
Towards Safe Multilingual Frontier AI Arturs Kanepajs · Vladimir Ivanov · Richard Moulange |
||
Workshop
|
HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions Xuhui Zhou · Hyunwoo Kim · Faeze Brahman · Liwei Jiang · Hao Zhu · Ximing Lu · Frank F. Xu · Bill Yuchen Lin · Niloofar Mireshghallah · Ronan Le Bras · Maarten Sap |
||
Workshop
|
Evaluating Synthetic Activations composed of SAE Latents in GPT-2 Nora Petrova · Giorgi Giglemiani · Chatrik Mangat · Jett Janiak · Stefan Heimersheim |
||
Workshop
|
An Adversarial Perspective on Machine Unlearning for AI Safety Jakub Łucki · Boyi Wei · Yangsibo Huang · Peter Henderson · Florian Tramer · Javier Rando |
||
Workshop
|
An Adversarial Perspective on Machine Unlearning for AI Safety Jakub Łucki · Boyi Wei · Yangsibo Huang · Peter Henderson · Florian Tramer · Javier Rando |
||
Poster
|
Thu 11:00 |
Explaining RL Decisions with Trajectories': A Reproducibility Study Karim Abdel Sadek · Matteo Nulli · Joan Velja · Jort Vincenti |
|
Workshop
|
A Safety-aware Framework for Generative Enzyme Design with Foundation Models Xiaoyi Fu · Tao Han · Yuan Yao · Song Guo |
||
Poster
|
Thu 11:00 |
Uncovering, Explaining, and Mitigating the Superficial Safety of Backdoor Defense Rui Min · Zeyu Qin · Nevin L. Zhang · Li Shen · Minhao Cheng |
|
Workshop
|
Adversarial Negotiation Dynamics in Generative Language Models Arinbjörn Kolbeinsson · Benedikt Kolbeinsson |