Poster
in
Workshop: Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling

"It Doesn’t Know Anything About my Work": Participatory Benchmarking and AI Evaluation in Applied Settings

Elizabeth Watkins · Emanuel Moss · Ramesh Manuvinakurike · Christopher Persaud · Giuseppe Raffa · Lama Nachman

Project Page [ OpenReview]

Abstract

This empirical paper investigates the benefits of socially embedded approaches to model evaluation. We present findings from a participatory benchmarking evaluation of an AI assistant deployed in a manufacturing setting, demonstrating how evaluation practices that incorporate end-users’ situated expertise enable more nuanced assessments of model performance. By foregrounding context-specific knowledge, these practices more accurately capture real-world functionality and inform iterative system improvement. We conclude by outlining implications for the design of context-aware AI evaluation frameworks.

Chat is not available.