Skip to yearly menu bar Skip to main content

Workshop: AI for Science: Mind the Gaps

A Fresh Look at De Novo Molecular Design Benchmarks

Austin Tripp · Gregor Simm · José Miguel Hernández-Lobato


De novo molecular design is a thriving research area in machine learning (ML) that lacks ubiquitous, high-quality, standardized benchmark tasks. Many existing benchmark tasks do not precisely specify a training dataset or an evaluation budget, which is problematic as they can significantly affect the performance of ML algorithms. This work elucidates the effect of dataset sizes and experimental budgets on established molecular optimization methods through a comprehensive evaluation with 11 selected benchmark tasks. We observe that the dataset size and budget significantly impact all methods' performance and relative ranking, suggesting that a meaningful comparison requires more than a single benchmark setup. Our results also highlight the relative difficulty of benchmarks, implying in particular that logP and QED are poor objectives. We end by offering guidance to researchers on their choice of experiments.