Skip to yearly menu bar Skip to main content

Workshop: I Can’t Believe It’s Not Better (ICBINB): Failure Modes in the Age of Foundation Models

How Many Raters Do You Need? Power Analysis for Foundation Models

Christopher Homan · Shira Wein · Chris Welty · Lora Aroyo


Due to their highly stochastic nature, as well as the complexity of the tasks they can perform, foundation models (large machine learning models) are poorly suited for conventional machine learning evaluation methods. This is because machine learning evaluation methods typically assume behavior to be deterministic and simple enough to be measured against gold standard data with unitary, authoritative, "correct" answers using straightforward metrics such as accuracy, precision, and recall. In this work, we propose an evaluation framework suitable for foundation models, which takes into account variance in the responses of both machine model and human rater. Utilizing recent advances in p-value estimation, we investigate the trade-offs between the number of items in a test set, the number of responses per item, the sampling method, and the metric, when measuring the comparative differences between two hypothetical foundation models at various degrees of similarity. When two models are very far apart in their predictive performance, fewer raters are needed to confidently compare them, as expected. However, as the models draw closer, we find that a larger number of annotators than are currently typical in annotation collection are needed to ensure the power analysis correctly reflects the difference in performance.

Chat is not available.