Poster

ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning

Shulin Huang ⋅ Linyi Yang ⋅ Yan Song ⋅ Shawn Chen ⋅ Leyang Cui ⋅ Ziyu Wan ⋅ Qingcheng Zeng ⋅ Ying Wen ⋅ Kun Shao ⋅ Weinan Zhang ⋅ Jun Wang ⋅ Yue Zhang

2025 Poster

Project Page [ Slides] [ Poster] [ OpenReview]

Abstract

Evaluating large language models (LLMs) poses significant challenges, particularly due to issues of data contamination and the leakage of correct answers. To address these challenges, we introduce ThinkBench, a novel evaluation framework designed to robustly evaluate the reasoning capability of LLMs. ThinkBench proposes a dynamic data generation method for constructing out-of-distribution (OOD) datasets and offers an OOD dataset that contains 2,912 samples drawn from reasoning tasks. ThinkBench unifies the evaluation of reasoning models and non-reasoning models. We evaluate 16 LLMs and 4 PRMs under identical experimental conditions and show that most of the LLMs' performance are far from robust and they face a certain level of data leakage. By dynamically generating OOD datasets, ThinkBench effectively provides a reliable evaluation of LLMs and reduces data contamination impact. Our data and codes are available at https://github.com/huangshulin123/ThinkBench.

Video

Chat is not available.