firstbacksecondback
73 Results
Workshop
|
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet Nathaniel Li · Ziwen Han · Ian Steneker · Willow Primack · Riley Goodside · Hugh Zhang · Zifan Wang · Cristina Menghini · Summer Yue |
||
Workshop
|
GPAI Evaluations Standards Taskforce: towards effective AI governance Patricia Paskov · Lukas Berglund · Everett Smith · Lisa Soder |
||
Workshop
|
Auto-Enhance: Towards a Meta-Benchmark to Evaluate AI Agents' Ability to Improve Other Agents Samuel Brown · Basil Labib · Codruta Lugoj · Sai Sasank Y |
||
Poster
|
Wed 16:30 |
What Makes and Breaks Safety Fine-tuning? A Mechanistic Study Samyak Jain · Ekdeep S Lubana · Kemal Oksuz · Tom Joy · Philip Torr · Amartya Sanyal · Puneet Dokania |
|
Competition
|
Sun 9:00 |
The NeurIPS 2024 LLM Privacy Challenge Qinbin Li · Junyuan Hong · Chulin Xie · Junyi Hou · Yiqun Diao · Zhun Wang · Dan Hendrycks · Zhangyang "Atlas" Wang · Bo Li · Bingsheng He · Dawn Song |
|
Workshop
|
Sun 11:05 |
Contributed Talk 3: LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet Nathaniel Li · Ziwen Han · Ian Steneker · Willow Primack · Riley Goodside · Hugh Zhang · Zifan Wang · Cristina Menghini · Summer Yue |
|
Workshop
|
Sandbag Detection through Model Impairment Cameron Tice · Philipp Kreer · Nathan Helm-Burger · Prithviraj Singh Shahani · Fedor Ryzhenkov · Teun van der Weij · Felix Hofstätter · Jacob Haimes |
||
Competition
|
Sun 13:30 |
CLAS 2024: The Competition for LLM and Agent Safety Zhen Xiang · Yi Zeng · Mintong Kang · Chejian Xu · Jiawei Zhang · Zhuowen Yuan · Zhaorun Chen · Chulin Xie · Fengqing Jiang · Minzhou Pan · Francesco Pinto · Junyuan Hong · Ruoxi Jia · Radha Poovendran · Bo Li |
|
Workshop
|
A Formal Framework for Assessing and Mitigating Emergent Security Risks in Generative AI Models: Bridging Theory and Dynamic Risk Mitigation aviral srivastava · Sourav Panda |
||
Workshop
|
Jailbreak Defense in a Narrow Domain: Failures of existing methods and Improving Transcript-Based Classifiers Tony Wang · John Hughes · Henry Sleight · Rylan Schaeffer · Rajashree Agrawal · Fazl Barez · Mrinank Sharma · Jesse Mu · Nir Shavit · Ethan Perez |
||
Workshop
|
Jailbreak Defense in a Narrow Domain: Failures of Existing Methods and Improving Transcript-Based Classifiers Tony Wang · John Hughes · Henry Sleight · Rylan Schaeffer · Rajashree Agrawal · Fazl Barez · Mrinank Sharma · Jesse Mu · Nir Shavit · Ethan Perez |
||
Workshop
|
Mitigating Downstream Model Risks via Model Provenance Keyu Wang · Scott Schaffter · Abdullah Norozi Iranzad · Doina Precup · Jonathan Lebensold · Megan Risdal |