Workshop
|
|
Ablation is Not Enough to Emulate DPO: A Mechanistic Analysis of Toxicity Reduction
Yushi Yang · Filip Sondej · Harry Mayne · Adam Mahdi
|
|
Workshop
|
Sun 16:15
|
Towards Interpretable Scientific Foundation Models: Sparse Autoencoders for Disentangling Dense Embeddings of Scientific Concepts
Charles O'Neill · Christine Ye · Kartheik Iyer · John Wu
|
|
Workshop
|
|
Towards Interpretable Scientific Foundation Models: Sparse Autoencoders for Disentangling Dense Embeddings of Scientific Concepts
Charles O'Neill · Christine Ye · Kartheik Iyer · John Wu
|
|
Poster
|
Wed 16:30
|
What Makes and Breaks Safety Fine-tuning? A Mechanistic Study
Samyak Jain · Ekdeep S Lubana · Kemal Oksuz · Tom Joy · Philip Torr · Amartya Sanyal · Puneet Dokania
|
|
Workshop
|
|
Pay Attention to What Matters
Pedro Silva · Fadhel Ayed · Antonio De Domenico · Ali Maatouk
|
|
Workshop
|
|
Measuring the Reliability of Causal Probing Methods: Tradeoffs, Limitations, and the Plight of Nullifying Interventions
Marc Canby · Adam Davies · Chirag Rastogi · Julia C Hockenmaier
|
|
Workshop
|
|
Competence-Based Analysis of Language Models
Adam Davies · Jize Jiang · Cheng Xiang Zhai
|
|
Workshop
|
|
Uncovering Uncertainty in Transformer Inference
Greyson Brothers · Willa Mannering · John Winder · Amber Tien
|
|