Automated Capability Evaluation of Foundation Models
Abstract
Current evaluation frameworks for foundation models rely on fixed, manually curated benchmarks, limiting coverage of model capabilities. We propose Active learning for Capability Evaluation, a scalable framework for automated fine-grained evaluation. Our framework leverages language models to decompose domains into semantically meaningful capabilities and generate diverse tasks, reducing human effort. It models a subject model’s performance as a capability function over a latent semantic space and applies active learning to prioritize the most informative evaluations. This adaptive strategy enables cost-efficient discovery of strengths, weaknesses, and failure modes that static benchmarks may overlook. Results show that this evaluation yields a more complete picture of model capabilities.