NeurIPS 2024

Workshop

Ablation is Not Enough to Emulate DPO: A Mechanistic Analysis of Toxicity Reduction
Yushi Yang · Filip Sondej · Harry Mayne · Adam Mahdi

Workshop

Sun 16:15

Towards Interpretable Scientific Foundation Models: Sparse Autoencoders for Disentangling Dense Embeddings of Scientific Concepts
Charles O'Neill · Christine Ye · Kartheik Iyer · John Wu

Workshop

Towards Interpretable Scientific Foundation Models: Sparse Autoencoders for Disentangling Dense Embeddings of Scientific Concepts
Charles O'Neill · Christine Ye · Kartheik Iyer · John Wu

Poster

Wed 16:30

What Makes and Breaks Safety Fine-tuning? A Mechanistic Study
Samyak Jain · Ekdeep S Lubana · Kemal Oksuz · Tom Joy · Philip Torr · Amartya Sanyal · Puneet Dokania

Workshop

Pay Attention to What Matters
Pedro Silva · Fadhel Ayed · Antonio De Domenico · Ali Maatouk

Workshop

Measuring the Reliability of Causal Probing Methods: Tradeoffs, Limitations, and the Plight of Nullifying Interventions
Marc Canby · Adam Davies · Chirag Rastogi · Julia C Hockenmaier

Workshop

Competence-Based Analysis of Language Models
Adam Davies · Jize Jiang · Cheng Xiang Zhai

Workshop

Uncovering Uncertainty in Transformer Inference
Greyson Brothers · Willa Mannering · John Winder · Amber Tien

Main Navigation

20 Results