Skip to yearly menu bar Skip to main content

Workshop

Towards Safe & Trustworthy Agents

Alexander Pan ⋅ Kimin Lee ⋅ Bo Li ⋅ Karthik Narasimhan ⋅ Dawn Song ⋅ Isabelle Barrass

Project Page [ OpenReview]

Abstract

Foundation models are increasingly being augmented with new modalities and access to a variety of tools and software. Systems that can take action in a more autonomous manner have been created by assembling agent architectures or scaffolds that include basic forms of planning and memory or multi-agent architectures. As these systems are made more agentic, this could unlock a wider range of beneficial use-cases, but also introduces new challenges in ensuring that such systems are trustworthy. Interactions between different autonomous systems create a further set of issues around multi-agent safety. The scope and complexity of potential impacts from agentic systems means that there is a need for proactive approaches to identifying and managing their risks. Our workshop will surface and operationalize these questions into concrete research agendas.

Video

Chat is not available.

Schedule

Timezone: America/Los_Angeles

9:00 AM

Opening Remark

Video

9:10 AM

Invited talk 1: João F. Henriques (Research Fellow, Royal Academy of Engineering)

Video

9:40 AM

Invited talk 2: David Bau (Assistant Professor, Northeastern)

Video

10:10 AM

Contributed Talks

Video

10:50 AM

Coffee Break

11:15 AM

Invited Talk 3: (Been Kim, Senior Staff Research Scientist, Google Deepmind)

Zaina Shaik

Video

11:45 AM

Live Poster Session 1

12:30 PM

Lunch

1:30 PM

Invited Talk 4: (David Krueger, Associate Professor, Cambridge)

Video

2:00 PM

Invited Talk 5: (Daniel Kang, Associate Professor, UIUC)

Video

2:30 PM

Invited Talk 6: (Yu Su, Distinguished Associate Professor, Ohio State))

Video

3:00 PM

Live Poster Session 2

3:45 PM

Coffee Break

4:00 PM

Panel Discussion and Reflection

Video

4:55 PM

Closing Remark

Characterizing Context Memorization and Hallucination of Language Models

James Flemings ⋅ Wanrong Zhang ⋅ Bo Jiang ⋅ Zafar Takhirov ⋅ Murali Annavaram

Position: AI Agents & Liability -- Mapping Insights from ML and HCI Research to Policy

Weiwei Pan ⋅ Siddharth Swaroop ⋅ Julia Smakman ⋅ Connor Dunlop ⋅ Lisa Soder

Towards Measuring Goal-Directedness in AI Systems

Dylan Xu ⋅ Juan-Pablo Rivera

Levels of Autonomy: Liability in the age of AI Agents

Julia Smakman ⋅ Lisa Soder ⋅ Connor Dunlop ⋅ Weiwei Pan ⋅ Siddharth Swaroop

Getting By Goal Misgeneralization With a Little Help From a Mentor

Tu Trinh ⋅ Mohamad Hosein Danesh ⋅ Khanh Nguyen ⋅ Benjamin Plaut

AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing

Ana Nunez ⋅ Nafis Tanveer Islam ⋅ Sumit Jha ⋅ peyman najafirad

Towards Deliberating Agents: Evaluating the Ability of Large Language Models to Deliberate

Arjun Karanam ⋅ Farnaz Jahanbakhsh ⋅ Sanmi Koyejo

Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of LLM Agents

Giorgio Piatti ⋅ Zhijing Jin ⋅ Max Kleiman-Weiner ⋅ Bernhard Schölkopf ⋅ Mrinmaya Sachan ⋅ Rada Mihalcea

Auto-Enhance: Towards a Meta-Benchmark to Evaluate AI Agents' Ability to Improve Other Agents

Samuel Brown ⋅ Basil Labib ⋅ Codruta Lugoj ⋅ Sai Sasank Y

Semantically Safe Robot Manipulation: From Semantic Scene Understanding to Motion Safeguards

Lukas Brunke ⋅ Yanni Zhang ⋅ Ralf Römer ⋅ Jack Naimer ⋅ Nikola Staykov ⋅ SiQi Zhou ⋅ Angela Schoellig

Sandbag Detection through Model Impairment

Cameron Tice ⋅ Philipp Kreer ⋅ Nathan Helm-Burger ⋅ Prithviraj Singh Shahani ⋅ Fedor Ryzhenkov ⋅ Teun van der Weij ⋅ Felix Hofstätter ⋅ Jacob Haimes

Measuring Implicit Bias in Explicitly Unbiased Large Language Models

Xuechunzi Bai ⋅ Angelina Wang ⋅ Ilia Sucholutsky ⋅ Tom Griffiths

Modelling the oversight of deceptive interpretability agents

Simon Lermen ⋅ Mateusz Dziemian

AI-LieDar : Examine the Trade-off Between Utility and Truthfulness in LLM Agents

Zhe Su ⋅ Xuhui Zhou ⋅ Sanketh Rangreji ⋅ Anubha Kabra ⋅ Julia Mendelsohn ⋅ Faeze Brahman ⋅ Maarten Sap

Trustworthy Conceptual Explanations for Neural Networks in Robot Decision-Making

Som Sagar ⋅ Aditya Taparia ⋅ Harsh Mankodiya ⋅ Pranav Bidare ⋅ Yifan Zhou ⋅ Ransalu Senanayake

Failures to Find Transferable Image Jailbreaks Between Vision-Language Models

Rylan Schaeffer ⋅ Dan Valentine ⋅ Luke Bailey ⋅ James Chua ⋅ Zane Durante ⋅ Cristobal Eyzaguirre ⋅ Joe Benton ⋅ Brando Miranda ⋅ Henry Sleight ⋅ Tony Wang ⋅ John Hughes ⋅ Rajashree Agrawal ⋅ Mrinank Sharma ⋅ Scott Emmons ⋅ Sanmi Koyejo ⋅ Ethan Perez

RED – Robust Environmental Design

Jinghan Yang

Keep on Swimming: Real Attackers Only Need Partial Knowledge of a Multi-Model System

Julian Collado ⋅ Kevin Stangl

Simulation System Towards Solving Societal-Scale Manipulation

Sneheel Sarangi ⋅ Maximilian Puelma Touzel ⋅ Austin Welch ⋅ Gayatri K ⋅ Dan Zhao ⋅ Zachary Yang ⋅ Hao Yu ⋅ Ethan Kosak-Hine ⋅ Tom Gibbs ⋅ Andreea Musulan ⋅ Camille Thibault ⋅ Reihaneh Rabbany ⋅ Jean-François Godbout ⋅ Kellin Pelrine

Emergence of Steganography Between Large Language Models

Yohan Mathew ⋅ Joan Velja ⋅ Ollie Matthews ⋅ Robert McCarthy ⋅ Dylan Cope ⋅ Nandi Schoots

Strategic Collusion of LLM Agents: Market Division in Multi-Commodity Competitions

Ryan Lin ⋅ Siddhartha Ojha ⋅ Kevin Cai ⋅ Maxwell Chen

Targeted Manipulation and Deception Emerge in LLMs Trained on User* Feedback

Marcus Williams ⋅ Micah Carroll ⋅ Constantin Weisser ⋅ Adhyyan Narang ⋅ Brendan Murphy ⋅ Anca Dragan

Improving LLM Reasoning with Multi-Agent Tree-of-Thought Validator Agent

Fatemeh Haji ⋅ Mazal Bethany ⋅ Maryam Tabar ⋅ Cho-Yu Chiang ⋅ Anthony Rios ⋅ peyman najafirad

Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference

Anton Xue ⋅ Avishree Khare ⋅ Rajeev Alur ⋅ Surbhi Goel ⋅ Eric Wong

AI Sandbagging: Language Models can Selectively Underperform on Evaluations

Teun van der Weij ⋅ Felix Hofstätter ⋅ Oliver Jaffe ⋅ Samuel Brown ⋅ Francis Ward

HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions

Xuhui Zhou ⋅ Hyunwoo Kim ⋅ Faeze Brahman ⋅ Liwei Jiang ⋅ Hao Zhu ⋅ Ximing Lu ⋅ Frank F. Xu ⋅ Bill Yuchen Lin ⋅ Niloofar Mireshghallah ⋅ Ronan Le Bras ⋅ Maarten Sap

Neural Interactive Proofs

Lewis Hammond ⋅ Sam Adam-Day

Lost in Translation: Jail Breaking Gemini and Revealing Biases in Large Language Models via Translation

Ezgi Korkmaz

PolicyLR: An LLM compiler for Logic-based Representation for Privacy Policies

Ashish Hooda ⋅ Rishabh Khandelwal ⋅ Prasad Chalasani ⋅ Kassem Fawaz ⋅ Somesh Jha

INDICT: Code Generation with Internal Dialogues of Critiques for Both Security and Helpfulness

Hung Le ⋅ Yingbo Zhou ⋅ Caiming Xiong ⋅ Silvio Savarese ⋅ Doyen Sahoo

Algorithmic Oversight for Deceptive Reasoning

Ege Onur Taga ⋅ Mingchen Li ⋅ Yongqi Chen ⋅ Samet Oymak

C-MCTS: Safe Planning with Monte Carlo Tree Search

Dinesh Parthasarathy ⋅ Georgios Kontes ⋅ Axel Plinge ⋅ Christopher Mutschler

Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

Rogerio Bonatti ⋅ Dan Zhao ⋅ Sara Abdali ⋅ Yinheng Li ⋅ Yadong Lu ⋅ Justin Wagle ⋅ Kazuhito Koishida ⋅ Arthur Bucker ⋅ Lawrence Jang ⋅ Dillon Dupont ⋅ Zheng Hui

The Elicitation Game: Stress-Testing Capability Elicitation Techniques

Felix Hofstätter ⋅ Jayden Teoh ⋅ Teun van der Weij ⋅ Francis Ward