CURIE: Evaluating LLMs on Multitask Scientific Long-Context Understanding and Reasoning
Abstract
Scientific problem-solving involves synthesizing information while applying expert knowledge. We introduce CURIE, a scientific long-Context Understanding and Reasoning Inference Evaluations benchmark to measure the potential of Large Language Models (LLMs) in assisting scientists in realistic workflows. This benchmark introduces ten challenging tasks curated by experts in six disciplines: materials science, condensed matter physics, quantum computing, geospatial analysis, biodiversity, and proteins; covering both experimental and theoretical workflows in science. We evaluate a range of closed and open LLMs on tasks in CURIE which requires domain expertise, comprehension of long in-context information, and multi-step reasoning. While Claude-3 shows consistent high comprehension across domains, the popular GPT-4o and command-R+fail dramatically on protein sequencing tasks. Overall there is much room for improvement for all models. We hope from this work can guide the future development of LLMs in sciences