Poster
On scalable oversight with weak LLMs judging strong LLMs
Zachary Kenton · Noah Siegel · Janos Kramar · Jonah Brown-Cohen · Samuel Albanie · Jannis Bulian · Rishabh Agarwal · David Lindner · Yunhao Tang · Noah Goodman · Rohin Shah
East Exhibit Hall A-C #4403
Scalable oversight protocols are alignment techniques which aim to allow the accuracy of human supervision to scale with the capabilities of superhuman AI. Debate, in which two capable AI agents compete to convince a less capable human judge, is the primary approach to scalable oversight. However, empirical research on debate has thus far focused on a single task where the capabilities of the human judge are artificially limited by withholding information required to accurately solve the task. Thus, this artificial limitation fails to capture the target scenario of generally capable AI agents, for whom the gaps with less capable humans are likely to span a broad variety of reasoning capabilities. To better understand the power and limitations of debate, in this paper we study natural language debates between frontier LLMs judged by less capable LLMs across a variety of tasks including both extractive QA and closed QA tasks involving mathematics/code/logical and multimodal reasoning. By varying the LLM model used as the judge we are able to systematically vary the capabilities gap in each task we study. We find that in tasks with no artificial judge limitations, LLM judges achieve higher accuracy in debate when compared to Consultancy -- in which a single capable LLM makes an argument to convince the judge. This is empirical evidence that debate is able to effectively bridge a broader variety of capabilities gaps. We further show that stronger debaters (as measured by Elo scores) have a larger advantage when arguing for the correct answer than the incorrect answer, and lead to higher judge accuracy, providing additional evidence for the utility of debate for scalable oversight.
Live content is unavailable. Log in and register to view live content