Reliable Fine-Grained Evaluation of Natural Language Math Proofs
Wenjie Ma · Andrei Cojocaru · Neel Kolhe · Robin Sharif · Haihan Zhang · Vincent Zhuang · Matei A Zaharia · Sewon Min
Abstract
Recent advances in large language models (LLMs) for math reasoning have largely focused on tasks with easily verifiable final answers; however, generating natural language math proofs remains an open challenge. We identify the absence of a reliable, fine-grained evaluator for LLM-generated math proofs as a critical gap. To address this, we propose a systematic methodology for developing and validating evaluators that assign fine-grained scores on a 0–7 scale to model-generated math proofs. We first introduce **ProofBench**, the first expert-annotated dataset of fine-grained proof ratings, spanning 145 problems from six major math competitions and 435 LLM-generated solutions from Gemini-2.5-Pro, o3, and DeepSeek-R1. With ProofBench, we systematically explore the evaluator design space across key axes: the backbone model, input context, instructions and evaluation workflow. Our analysis delivers **ProofGrader**, an evaluator that combines a strong reasoning backbone LM, rich context from reference solutions and marking schemes, and a simple ensembling method; it achieves a low Mean Absolute Error (MAE) of 0.926 against expert scores, significantly outperforming naive baselines. Finally, we demonstrate its practical utility in a best-of-$n$ selection task: at $n=16$, ProofGrader achieves an average score of 4.14 (out of 7), closing 78% of the gap between a naive binary evaluator (2.48) and the human oracle (4.62), highlighting its potential to advance downstream proof generation.
Chat is not available.
Successful Page Load