SAIR: Enabling Deep Learning for Protein-Ligand Interactions with a Synthetic Structural Dataset
Pablo Lemos · Zane Beckwith · Sasaank Bandi · Jordan Crivelli-Decker · Benjamin Shields · Thomas Merth · Punit Jha · Nicola De Mitri · Tiffany Callahan · Romelia Salomon-Ferrer · Martin Ganahl
Abstract
Predicting protein-ligand binding affinities is crucial for drug discovery but is limited by the scarcity of high-quality 3D structures with measured activity. We present the \textbf{Structurally Augmented IC50 Repository (SAIR)}, the largest public dataset of protein-ligand 3D structures with activity data, containing \textbf{$5,244,285$ million structures across $1,048,857$ protein-ligand systems} from ChEMBL and BindingDB, computationally folded using Boltz-1x. The PoseBusters algorithm shows that approximately $97\%$ of structures are phisically valid. Benchmarking traditional scoring functions (Vina, Vinardo) and machine learning models (OnionNet-2, AEV-PLIG) indicates that ML models outperform classical methods but still correlate poorly with true affinities, emphasizing the need for models adapted to synthetic structures. SAIR provides a foundation for developing next-generation binding-affinity prediction methods.
Chat is not available.
Successful Page Load