Measuring LLM Generation Spaces with EigenScore
Abstract
An LLM's generation space for a given prompt --- the range of semantically distinct outputs it could produce --- provides a window into the model's implicit task representation. We currently lack a metric for characterizing this space. In this work, we argue that the EigenScore metric (originally developed for hallucination detection) captures the size of this generation space. To develop this understanding, we construct synthetic datasets of prompt pairs with known generation space relationships (complement, subset, etc.). We show that EigenScore reliably predicts a prompt’s generation space size, outperforming other metrics like perplexity and entropy. We provide further evidence for this understanding of EigenScore by showing a strong connection between EigenScore and the length of reasoning tokens for the same prompt. Our work uses EigenScore to contribute a cognitive understanding of a model's generation space size and how it relates to reasoning abilities of LLMs.