Poster
in
Workshop: MATH-AI: The 5th Workshop on Mathematical Reasoning and AI

BrokenMath: A Benchmark for Sycophancy in Theorem Proving with LLMs

Ivo Petrov ⋅ Jasper Dekoninck ⋅ Martin Vechev

Project Page [ OpenReview]

Abstract

Large language models (LLMs) have shown strong performance on mathematical benchmarks. However, they are also prone to sycophancy, providing convincing but flawed proofs for incorrect theorems provided by users. Unfortunately, existing benchmarks for mathematical sycophancy are limited, as they rely on simple and often-contaminated final-answer problems, rather than more difficult proof-based tasks. To address this, we introduce BrokenMath, the first benchmark for evaluating LLMs' sycophancy in natural language theorem proving. \bench is built from advanced 2025 competition problems, which are perturbed with an LLM to produce false statements and subsequently refined through expert review. Using an LLM-as-a-judge, we evaluate state-of-the-art LLMs and find that sycophancy is widespread, with the best model, GPT-5, producing sycophantic answers 29% of the time. We further investigate several mitigation strategies and find that these approaches reduce, but do not eliminate, sycophancy.

Chat is not available.