Faults in our Formal Benchmarks
Pawan Sasanka Ammanmanchi · Siddharth Bhat
Abstract
Recent focus in LLM-assisted theorem proving has seen the rapid introduction of several benchmarks. Benchmarks in Lean are considered highly reliable because of the nature of proof assistants \& formal languages. However, we question the quality of popular datasets and find that multiple popular benchmarks contain material defects. We discuss common issues that persist across datasets, such as omitted side conditions, misformalization, and incorrect/incomplete translations, through examples. We aim to establish that such defects occur across all datasets we examined, despite their prevalent use. Bad datasets pose a threat to the quality of evaluations in the field, and we propose some better dataset standards for such releases.
Chat is not available.
Successful Page Load