DMPKBench: A Multi-Modal Benchmark for Evaluating LLMs and Agents in Drug Discovery DMPK Tasks
Abstract
With the rapid progress of large language models (LLMs) and multi-agent systems, there is a growing need for fair and comprehensive evaluation of their capacity to address complex tasks in specialized scientific domains. Drug metabolism and pharmacokinetics (DMPK) constitutes a critical stage in drug discovery, requiring interdisciplinary reasoning and integration of diverse knowledge. To meet this challenge, we constructed DMPKBench, a comprehansive benchmark designed to evaluate LLM and multi-agent performance in DMPK-related tasks. Grounded in real-world drug pipeline requirements, DMPKBench covers five core competencies essential to domain experts: experimental design and troubleshooting, interpretation of experimental results, Absorption, Distribution, Metabolism, Excretion, and Toxicity (ADMET) multi-parameter optimization, pharmacokinetic (PK) modeling and simulation, and preclinical-to-clinical PK translation to the human body. The benchmark comprises a large-scale knowledge base of more than 100,000 question–answer pairs across multiple modalities, including multi-step quantitative reasoning (e.g. human dose prediction), troubleshooting of real-world DMPK data tables and PK curves, and integrated problem-solving scenarios. DMPKBench provides a high-quality and rigorously curated dataset, with four dimensions quality-controlled by experts and one supported by experimental evidence. In comparative evaluations, LLM accuracy ranged from 11 percent to 89 percent, revealing substantial gaps across dimensions. Overall, DMPKBench provides a high-quality and domain-specific foundation for advancing LLMs and multi-agent systems in drug discovery, with a subset to be openly released at https://anonymous.4open.science/r/DMPKBench-93B3/.