Skip to yearly menu bar Skip to main content


Breaking Bad: Interpretability-Based Safety Audits of State-of-the-Art LLMs

Krishiv Agarwal · Ramneet Kaur · Colin Samplawski · Manoj Acharya · Anirban Roy · Daniel Elenius · Brian Matejek · Adam Cobb · Susmit Jha

Abstract

Chat is not available.