NeurIPS 2024

Workshop

In Search of the $\textit{Successful}$ Interpolation: On the Role of $\textit{Sharpness}$ in CLIP Generalization
alireza abdollahpour

Workshop

Interpretable AI in Human-Machine Systems: Insights from Human-in-the-Loop Product Recommendation Engines
Pooria Assadi · NIMA SAFAEI

Workshop

Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in LLMs
Aashiq Muhamed · Jake Mendel · Lucius Bushnaq · Mona Diab · Virginia Smith

Poster

Fri 11:00

Interpreting Learned Feedback Patterns in Large Language Models
Luke Marks · Amir Abdullah · Clement Neo · Rauno Arike · David Krueger · Philip Torr · Fazl Barez

Workshop

Towards Interpretable Scientific Foundation Models: Sparse Autoencoders for Disentangling Dense Embeddings of Scientific Concepts
Charles O'Neill · Christine Ye · Kartheik Iyer · John Wu

Workshop

Sun 16:15

Towards Interpretable Scientific Foundation Models: Sparse Autoencoders for Disentangling Dense Embeddings of Scientific Concepts
Charles O'Neill · Christine Ye · Kartheik Iyer · John Wu

Workshop

Demo: Harnessing Generative AI for Comprehensive Evaluation of Medical Imaging AI
Yisak Kim · Seunghyun Jang · Soyeon Kim · Kyungmin Jeon · Chang Min Park

Poster

Fri 11:00

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
Rohan Gupta · Iván Arcuschin Moreno · Thomas Kwa · Adrià Garriga-Alonso

Affinity Event

“Compassionately”: Increasing Plurality Awareness through Community-powered AI
Hala Sheta · Mohamed Ahmed · Syed Ishtiaque Ahmed

Poster

Wed 16:30

What Makes and Breaks Safety Fine-tuning? A Mechanistic Study
Samyak Jain · Ekdeep S Lubana · Kemal Oksuz · Tom Joy · Philip Torr · Amartya Sanyal · Puneet Dokania

Poster

Wed 16:30

Iteration Head: A Mechanistic Study of Chain-of-Thought
Vivien Cabannes · Charles Arnal · Wassim Bouaziz · Xingyu Yang · Francois Charton · Julia Kempe

Workshop

A Cognitive Framework for Learning Debiased and Interpretable Representations via Debiasing Global Workspace
Jinyung Hong · Eun Som Jeon · Changhoon Kim · Keun Hee Park · Utkarsh Nath · 'YZ' Yezhou Yang · Pavan Turaga · Theodore P. Pavlic

Main Navigation

34 Results