Wining solution: Early Signals of Scientific Knowledge: Separating Scientific Truth from Coherent Noise
Abstract
The resource-intensive nature of Large Language Model (LLM) development necessitates the empirical identification of promising architectures at early training stages to avoid unnecessary costs. Addressing this need, the NeurIPS 2025 E2LM Competition challenged participants to develop an evaluation task capable of discerning intrinsic scientific knowledge among different models and architectures within the first 250,000 training steps. Our proposed solution operates on the hypothesis that, while LLMs rapidly acquire the capacity to generate grammatically coherent text (fluency), a truly promising architecture should exhibit a significantly higher probability of generating or selecting scientific truths compared to factually empty propositions (fiction). Additionally, to mitigate noise in this early-stage evaluation, we introduce a domain filtering approach. This method removes questions determined to be outside the models' training distribution, ensuring that the metric accurately reflects the learning capabilities of scientific knowledge.