Realizing LLMs’ Causal Potential Requires Science-Grounded, Novel Benchmarks
Abstract
Recent claims of strong performance by Large Language Models (LLMs) on causaldiscovery tasks are undermined by a critical flaw: many evaluations rely on bench-marks likely included in LLMs’ pretraining data, raising concerns that apparentsuccess reflects memorization rather than genuine reasoning. This risks creating amisleading narrative that LLM-only methods, which ignore observational data, out-perform classical statistical approaches. We challenge this view by asking whetherLLMs truly reason about causal structure, how such reasoning can be measuredreliably without leakage, and whether LLMs can be trusted for causal discoveryin real scientific domains. We argue that realizing their potential for acceleratingscientific discovery requires two shifts: developing robust evaluation protocolsbased on recent, unseen scientific studies to avoid dataset leakage, and designinghybrid methods that combine LLM-derived world knowledge with statistical ap-proaches. To this end, we outline a practical recipe for constructing causal graphsfrom post-training scientific publications, ensuring evaluations remain leakage-freewhile encompassing both established and novel causal relationships.