Poster
in
Workshop: Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling

Evaluating Language Models' Evaluations of Games

Katie Collins ⋅ Cedegao (Ced) Zhang ⋅ Graham Todd ⋅ Lance Ying ⋅ Mauricio Barba ⋅ Ryan Liu ⋅ Adrian Weller ⋅ Ionatan Kuperwajs ⋅ Catherine Wong ⋅ Josh Tenenbaum ⋅ Tom Griffiths

Project Page [ OpenReview]

Abstract

Reasoning is not just about solving problems---it is also about evaluating which problems are worth solving at all. To date, evaluation of artificial intelligence (AI) systems has focused primarily on how they solve problems, often by focusing on how models play games. In this paper, we advocate for a new paradigm that assesses AI systems' evaluation of games. We leverage a large-scale dataset of over 100 novel board games and hundreds of human judgments to compare evaluations produced by language and reasoning models against those of people and symbolic computational agents. We consider two kinds of evaluative queries: assessing the payoff or fairness of games and assessing the funness of games. These queries span two dimensions relevant to the design of evaluations of AI evaluations: how complex a query is to compute and how difficult a query is to quantify. We find that reasoning models are generally more aligned to people in their evaluations of games. However, we observe a non-monotonic relationship: as models get closer to game-theoretic optimal, their fit to human data weakens. We observe more ``jaggedness'' across models for assessing funness, in line with the greater difficulty of quantifying this query.

Chat is not available.