Smiles2Dock: a large-scale dataset for ML-based docking score prediction using AlphaFold structures
Abstract
Docking is a crucial component in drug discovery aimed at predicting the binding conformation and affinity between small molecules and target proteins. ML-based docking has recently emerged as a prominent approach, outpacing traditional methods like DOCK and AutoDock Vina in handling the growing scale and complexity of molecular libraries. However, the availability of comprehensive and user-friendly datasets for training and benchmarking ML-based docking algorithms remains limited. Moreover, existing datasets rely on proteins with experimentally determined structures and known ligand binding pockets, making them unusable for the growing number of proteins with only predicted structures. We introduce Smiles2Dock, an open large-scale dataset for molecular docking that addresses this gap. We created a framework combining P2Rank for binding pocket prediction and AutoDock Vina for docking, enabling us to dock 1.7 million ligands from the ChEMBL database against 11 genetically validated proteins from AlphaFold, resulting in over 17 million protein-ligand binding scores. Since AlphaFold-predicted structures do not include known ligand binding sites, our use of P2Rank allows docking to be performed without any experimental structure information, a first at this scale. The dataset encompasses a diverse set of biologically relevant compounds and enables researchers to benchmark all major approaches for ML-based docking such as Graph, Transformer, and CNN-based methods. We also introduce a novel Transformer-based architecture for docking score prediction and set it as an initial benchmark for our dataset.