CauSciBench: Assessing LLM Causal Reasoning for Scientific Research
Abstract
While large language models (LLMs) are increasingly integrated into scientific research, their capability to perform causal inference, a cornerstone of scientific induction, remains under-evaluated. Existing benchmarks either focus narrowly on verifying method execution or provide open-ended tasks that lack precision in defining causal estimands, methodological choices, and variable selection. To address this gap, we introduce CauSciBench, a comprehensive benchmark that combines expert-curated problems from published research papers with diverse synthetic scenarios. Our benchmark spans both the potential outcomes framework and Pearl's structural causal model (SCM) framework, enabling systematic evaluation of LLM causal reasoning capabilities.