Skip to yearly menu bar Skip to main content

Workshop: I Can’t Believe It’s Not Better (ICBINB): Failure Modes in the Age of Foundation Models

A Study on the Calibration of In-context Learning


Modern auto-regressive models are trained to minimize log loss by predicting the next word so that they are expected to get calibrated answers when framing problems as next-token prediction tasks. We study such formulation of in-context learning, a widely used way to adapt frozen large language models (LLMs), and found the trade-offs between performance and calibration on a wide range of natural language understanding and reasoning tasks. Human evaluation shows that the hallucination rates can well align with the miscalibrated results. Furthermore, we find that selecting in-context examples from test datasets and common recalibration techniques that are widely effective such as temperature scaling may provide limited gains for calibration errors, suggesting that new methods may be required for settings where models are expected to be reliable.

Chat is not available.