CURE-Bench: Competition on Reasoning Models for Drug Decision-Making in Precision Therapeutics
Abstract
Precision therapeutics require models that can reason over complex relationships between patients, diseases, and drugs. Large language models and large reasoning models, especially when combined with external tool use and multi-agent coordination, have demonstrated the potential to perform structured, multi-step reasoning in clinical settings. However, existing benchmarks (mostly QA benchmarks) do not evaluate these capabilities in the context of real-world therapeutic decision-making. We present CURE-Bench, a competition and benchmark for evaluating AI models in drug decision-making and treatment planning. CURE-Bench includes clinically grounded tasks such as recommending treatments, assessing drug safety and efficacy, designing treatment plans, and identifying repurposing opportunities for diseases with limited therapeutic options. The competition has two tracks: one for models reasoning using internal knowledge, and another one for agentic reasoning that integrates external tools and real-time information. Evaluation data are generated using a validated multi-agent pipeline that produces realistic questions, reasoning traces, and tool-based solutions. Participants will have access to baselines spanning both open-weight and API-based models, along with standardized metrics for correctness, factuality, interpretability, and robustness. Human expert evaluation provides an additional layer of validation. CURE-Bench provides a rigorous, reproducible competition framework for assessing the performance, robustness, and interpretability of reasoning models in high-stakes clinical applications. It will accelerate the development of therapeutic AI and foster collaboration between AI and therapeutics communities.
Schedule
|
3:05 PM
|