Skip to yearly menu bar Skip to main content

Workshop: Socially Responsible Language Modelling Research (SoLaR)

Evaluating Superhuman Models with Consistency Checks

Lukas Fluri · Daniel Paleka · Florian Tramer


If machine learning models were to achieve superhuman abilities at various reasoning or decision-making tasks, how would we go about evaluating such models, given that humans would necessarily be poor proxies for ground truth? In this paper, we propose a framework for evaluating superhuman models via consistency checks. Our premise is that while the correctness of superhuman decisions may be impossible to evaluate, we can still surface mistakes if the model's decisions fail to satisfy certain logical, human-interpretable rules. We investigate two tasks where correctness of decisions is hard to verify: due to either superhuman model abilities, or to otherwise missing ground truth: evaluating chess positions and forecasting future events. Regardless of a model's (possibly superhuman) performance on these tasks, we can discover logical inconsistencies in decision making: a chess engine assigning opposing valuations to semantically identical boards; or GPT-4 forecasting that sports records will evolve non-monotonically over time.

Chat is not available.