Foundation Model Interventions

Workshop

Foundation Model Interventions

Pau Rodriguez · Arno Blaas · Desi R Ivanova · Sahra Ghalebikesabi · Yuki M Asano · Katherine Metcalf · Xavier Suau

West Meeting Room 121, 122

Sun 15 Dec, 8:45 a.m. PST

[ Abstract ] Workshop Website

[ OpenReview]

The increasing capabilities of foundation models have raised concerns about their potential to generate undesirable content, perpetuate biases, and promote harmful behaviors. To address these issues, we propose a workshop that focuses on understanding the inner workings of foundation models and identifying actionable mechanisms involved in generation. Recent studies have shown promise in directly intervening on model activations or a low-rank subset of the weights to provide fine-grained control over model generation to mitigate the generation of harmful and toxic content. This workshop aims to bring together researchers to explore methods for improving the controllability of foundation models and developing a better understanding of their behavior and potential misuse.

Chat is not available.

Timezone: America/Los_Angeles

Schedule

Sun 8:45 a.m. - 9:00 a.m.	Welcome and Opening Remarks ( Intro ) > SlidesLive Video	🔗
Sun 9:00 a.m. - 9:45 a.m.	Atticus Geiger: The Current State of Interpretability and Ideas for Scaling Up ( Invited Talk ) > SlidesLive Video	Atticus Geiger 🔗
Sun 9:45 a.m. - 10:15 a.m.	Spotlight Talks ( Spotlight Talks ) >	🔗
Sun 9:45 a.m. - 9:51 a.m.	LoFiT: Localized Fine-tuning on LLM Representations ( Spotlight Talk ) > SlidesLive Video	Fangcong Yin · Xi Ye · Greg Durrett 🔗
Sun 9:51 a.m. - 9:57 a.m.	Decomposing and Editing Predictions by Modeling Model Computation ( Spotlight Talk ) >	Harshay Shah · Andrew Ilyas · Aleksander Madry 🔗
Sun 9:57 a.m. - 10:03 a.m.	Analyzing (In)Abilities of SAEs via Formal Languages ( Spotlight Talk ) > SlidesLive Video	Abhinav Menon · Manish Shrivastava · David Krueger · Ekdeep S Lubana 🔗
Sun 10:03 a.m. - 10:09 a.m.	Towards Reliable Evaluation of Behavior Steering Interventions in LLMs ( Spotlight Talk ) > SlidesLive Video	Itamar Pres · Laura Ruis · Ekdeep S Lubana · David Krueger 🔗
Sun 10:09 a.m. - 10:15 a.m.	Probing the Decision Boundaries of In-context Learning in Large Language Models ( Spotlight Talk ) > SlidesLive Video	Siyan Zhao 🔗
Sun 10:15 a.m. - 10:45 a.m.	Coffee Break	🔗
Sun 10:45 a.m. - 11:30 a.m.	Fernanda Viégas: AI Dashboard Design: A User-Centered Approach to Interpretability ( Invited Talk ) > SlidesLive Video	Fernanda Viégas 🔗
Sun 11:30 a.m. - 12:00 p.m.	Junior Panel Discussion ( Panel Discussion ) > SlidesLive Video	🔗
Sun 12:00 p.m. - 1:00 p.m.	Lunch Break	🔗
Sun 1:00 p.m. - 2:00 p.m.	Poster Session ( Poster Session ) >	🔗
Sun 2:00 p.m. - 2:45 p.m.	David Ha: The Future of Collective Intelligence and Meta Evolution for Foundation Models ( Invited Talk ) > SlidesLive Video	David Ha 🔗
Sun 2:45 p.m. - 3:15 p.m.	Coffe Break	🔗
Sun 3:15 p.m. - 4:00 p.m.	Jacob Steinhardt: Scalably Understanding AI with AI ( Invited Talk ) > SlidesLive Video	Jacob Steinhardt 🔗
Sun 4:00 p.m. - 4:55 p.m.	Panel Discussion ( Panel Discussion ) > SlidesLive Video	Fernanda Viégas · Neel Nanda · Atticus Geiger · Jacob Steinhardt 🔗
Sun 4:55 p.m. - 5:00 p.m.	Closing Remarks and Award Ceremony ( Outro ) > SlidesLive Video	🔗
-	Overcoming Limitations of Steering Vectors with Low-Rank Representation Steering ( Poster ) > link Link	Dmitrii Krasheninnikov · David Krueger 🔗
-	Do LLMs internally know'' when they follow instructions? ( Poster ) > link Link	Juyeon Heo · Christina Heinze-Deml · Shirley Ren · Oussama Elachqar · Udhyakumar Nallasamy · Andy Miller · Jaya Narain 🔗
-	LoFiT: Localized Fine-tuning on LLM Representations ( Poster ) > link Link	Fangcong Yin · Xi Ye · Greg Durrett 🔗
-	Ablation is Not Enough to Emulate DPO: A Mechanistic Analysis of Toxicity Reduction ( Poster ) > link Link	Yushi Yang · Filip Sondej · Harry Mayne · Adam Mahdi 🔗
-	Is Free Self-Alignment Possible? ( Poster ) > link Link	Dyah Adila · Changho Shin · Yijing Zhang · Frederic Sala 🔗
-	Steering semantic search with interpretable features from sparse autoencoders ( Poster ) > link Link	Christine Ye · Charles O'Neill · John Wu · Kartheik Iyer 🔗
-	Zero-to-Hero: Enhancing Zero-Shot Novel View Synthesis via Attention Map Filtering ( Poster ) > link Link	Ido Sobol · Chenfeng Xu · Or Litany 🔗
-	Steering Large Language Models using Conceptors: Improving Addition-Based Activation Engineering ( Poster ) > link Link	Joris Postmus · Steven Abreu 🔗
-	Measuring the Reliability of Causal Probing Methods: Tradeoffs, Limitations, and the Plight of Nullifying Interventions ( Poster ) > link Link	Marc Canby · Adam Davies · Chirag Rastogi · Julia C Hockenmaier 🔗
-	Uncovering Uncertainty in Transformer Inference ( Poster ) > link Link	Greyson Brothers · Willa Mannering · John Winder · Amber Tien 🔗
-	Algorithmic Oversight for Deceptive Reasoning ( Poster ) > link Link	Ege Onur Taga · Mingchen Li · Yongqi Chen · Samet Oymak 🔗
-	Probing the Decision Boundaries of In-context Learning in Large Language Models ( Poster ) > link Link	Siyan Zhao · Tung Nguyen · Aditya Grover 🔗
-	Comparing Bottom-Up and Top-Down Steering Approaches on In-Context Learning Tasks ( Poster ) > link Link	Madeline Brumley · Joe Kwon · David Krueger · Dmitrii Krasheninnikov · Usman Anwar 🔗
-	Linearly Controlled Language Generation with Performative Guarantees ( Poster ) > link Link	Emily Cheng · Marco Baroni · Carmen Amo Alonso 🔗
-	Entropy-Based Decoding for Retrieval-Augmented Large Language Models ( Poster ) > link Link	Zexuan Qiu · Zijing Ou · Bin Wu · Jingjing Li · Aiwei Liu · Irwin King 🔗
-	Toward Explanation Bottleneck Models ( Poster ) > link Link	Shin'ya Yamaguchi · Kosuke Nishida 🔗
-	Can sparse autoencoders be used to decompose and interpret steering vectors? ( Poster ) > link Link	Harry Mayne · Yushi Yang · Adam Mahdi 🔗
-	WISE: Rethinking the Knowledge Memory for Lifelong Model Editing of Large Language Models ( Poster ) > link Link	Peng Wang · Zexi Li · Ningyu Zhang · Ziwen Xu · Yunzhi Yao · Yong Jiang · Pengjun Xie · Fei Huang · Huajun Chen 🔗
-	Representation Tuning ( Poster ) > link Link	Christopher Ackerman 🔗
-	SCIURus: Shared Circuits for Interpretable Uncertainty Representations in Language Models ( Poster ) > link Link	Carter Teplica · Yixin Liu · Arman Cohan · Tim G. J. Rudner 🔗
-	Understanding Visual Concepts Across Models ( Poster ) > link Link	Brandon Trabucco · Max Gurinas · Kyle Doherty · Ruslan Salakhutdinov 🔗
-	Secret Seeds in Text-to-Image Diffusion Models ( Poster ) > link Link	Katherine Xu · Lingzhi Zhang · Jianbo Shi 🔗
-	Analyzing (In)Abilities of SAEs via Formal Languages ( Poster ) > link Link	Abhinav Menon · Manish Shrivastava · Ekdeep S Lubana · David Krueger 🔗
-	Pay Attention to What Matters ( Poster ) > link Link	Pedro Silva · Fadhel Ayed · Antonio De Domenico · Ali Maatouk 🔗
-	Decomposing and Editing Predictions by Modeling Model Computation ( Poster ) > link Link	Harshay Shah · Andrew Ilyas · Aleksander Madry 🔗
-	Linguistic Minimal Pairs Elicit Linguistic Similarity in Large Language Models ( Poster ) > link Link	Xinyu Zhou · Delong Chen · Samuel Cahyawijaya · Xufeng Duan · Zhenguang Cai 🔗
-	Semantic Entropy Neurons: Encoding Semantic Uncertainty in the Latent Space of LLMs ( Poster ) > link Link	Jiatong Han · Jannik Kossen · Muhammed Razzak · Yarin Gal 🔗
-	Towards Reliable Evaluation of Behavior Steering Interventions in LLMs ( Poster ) > link Link	Itamar Pres · Laura Ruis · Ekdeep S Lubana · David Krueger 🔗
-	Unveiling and Manipulating Concepts in Time Series Foundation Models ( Poster ) > link Link	Michal Wilinski · Mononito Goswami · Nina Żukowska · Willa Potosnak · Artur Dubrawski 🔗
-	GPT-2 Small Fine-Tuned on Logical Reasoning Summarizes Information on Punctuation Tokens ( Poster ) > link Link	Sonakshi Chauhan · Atticus Geiger 🔗
-	Extracting Paragraphs from LLM Token Activations ( Poster ) > link Link	Nicky Pochinkov · Angelo Benoit · Lovkush Agarwal · Zainab Ali Majid · Lucile Ter-Minassian 🔗
-	Analysing the Residual Stream of Language Models Under Knowledge Conflicts ( Poster ) > link Link	Yu Zhao · Xiaotang Du · Giwon Hong · Aryo Gema · Alessio Devoto · Hongru WANG · Xuanli He · Kam-Fai Wong · Pasquale Minervini 🔗
-	Dipper: Diversity in Prompts for Producing Large Language Model Ensembles in Reasoning tasks ( Poster ) > link Link	Gregory Kang Ruey Lau · Wenyang Hu · Liu Diwen · Chen Jizhuo · See-Kiong Ng · Bryan Kian Hsiang Low 🔗