Agentic Full-Stack Benchmarking for Knowledge Work
Abstract
In less than a year, AI agents have evolved from a research curiosity into the foundation of some of the largest software platform updates in decades. These systems promise to automate substantial portions of knowledge work, and their progress has been rapid, with early 2025 reports by METR suggesting that the complexity of solvable tasks doubles roughly every seven months. In this talk, we take a closer empirical look at this claim by examining what it truly takes to benchmark agentic performance on long-running, open-ended knowledge work tasks. We review recent contributions from ServiceNow AI Research and others across domains such as browser use, multimodal understanding, data analytics, and deep research. We also discuss benchmarks that evaluate agentic safety and security, arguing that these dimensions cannot be meaningfully separated from primary task performance. Our analysis leads to a more nuanced picture of the field, highlighting both genuine advances and persistent challenges that frontier agents have yet to overcome.
Goals: 1) motivate the need to benchmark AI agents in realistic enterprise settings 2) give an overview of recent research in this direction at ServiceNow
Audience: academic and industry researchers interested in measuring the capabilities and reliability of AI agents.