Skip to yearly menu bar Skip to main content


Poster

Evaluating the World Model Implicit in a Generative Model

Keyon Vafa · Justin Chen · Jon Kleinberg · Sendhil Mullainathan · Ashesh Rambachan


Abstract:

Recent work suggests that large language models may implicitly learn world models. How should we assess this possibility? We formalize this question for the case where the underlying reality is governed by a deterministic finite automaton, a scenario that includes problems as diverse as simple logical reasoning, geographic navigation, game-playing, and chemistry. Our framework suggests that existing diagnostics can be misleading: generative models can look near perfect on these tests while possessing incoherent world models. We propose new evaluation metrics for world model recovery inspired by the classic Myhill-Nerode theorem from language theory. We illustrate their utility in three domains: game playing, logic puzzles, and navigation. In all domains, the generative models we consider do well on existing diagnostics, but our evaluation metrics reveal their world models to be far less coherent than they appear. Such incoherence creates fragility: using a generative model to solve related but subtly different tasks can lead it to fail badly. Building generative models that meaningfully capture the underlying logic of the domains they model would be immensely valuable; our results suggest new ways to assess how close any given model is to that goal.

Live content is unavailable. Log in and register to view live content