Smarter Sampling for LLM Judges: Reliable Evaluation on a Budget
Alyssa Unell · Natalie Dullerud · Nigam Shah · Sanmi Koyejo
Abstract
LLM-as-a-judge is increasingly dominant as a framework for scalable evaluation of artificial intelligence (AI) systems and agents. The technique involves prompting a large language model (LLM) to assess the capabilities of another AI model. Although the system reduces human annotation requirements, the need for human oversight is still required to gauge the performance of the judge LLM. However, human annotations can be expensive to obtain, particularly in domains that require expert annotations, such as clinical text generation. Thus, the problem drives the questions: (1) Can we bound the number of human annotations necessary to gauge the performance of our judge LLM? and (2) Can we curate the subset of data for human annotation in a principled way? In this paper, we answer (1) through a Chernoff bound for intraclass correlation coefficient (ICC), the primary metric for measuring LLM-as-judge performance relative to human labels. To explore (2), we propose $7$ sampling methods and demonstrate the utility of these algorithms relative to random sampling in simulated and real-world data. We show tighter bounds for sampling requirements and up to a 41\% relative improvement in ICC precision compared to random baselines.
Chat is not available.
Successful Page Load