Invited Talk
in
Workshop: Regulatable ML: Towards Bridging the Gaps between Machine Learning Research and Regulations Sun, Dec 7, 2025 • 2:40 PM – 3:10 PM PST

Invited Talk 4: Guarding the Age of Agents: Advancing Risk Assessment, Guardrails, and Security Certification

Bo Li

Abstract

Autonomous agents built on foundation models are increasingly being deployed in dynamic, high-stakes real-world environments—from web automation to AI operating systems. Despite their promise, these agents remain highly susceptible to adversarial instructions and manipulation, posing serious risks including policy violations, data leakage, and financial harm. In this talk, I will present a comprehensive framework for assessing and strengthening the safety of AI agents. We begin by examining principles and methodologies for robust agent evaluation, with a focus on red teaming-based stress testing across diverse adversarial scenarios, ranging from agent poisoning to WebAgent manipulation to general blackbox agent attacks. Building on these foundations, I will introduce Shield Agent, the first guardrail agent explicitly designed to enforce policy-aligned behavior in autonomous agents through structured reasoning and verification. Shield agent constructs a verifiable safety model by extracting and encoding actionable safety rules from formal AI security policy documents into probabilistic graphical rule circuits. Given a target agent’s action trajectory, Shield agent dynamically retrieves relevant safety rules and synthesizes shielding strategies by building a rich tool library and formally executable code. To support comprehensive AI agent evaluations, I will also introduce a novel benchmark comprising 3,000 safety-critical instruction-action pairs derived from state-of-the-art attack scenarios across six web environments and seven distinct risk categories. This talk aims to highlight both the urgent challenges and emerging solutions in building trustworthy, resilient AI agents—hoping to lay the groundwork for a new generation of safety-aligned autonomous systems.

Video

Chat is not available.