OligoGym: Curated Datasets and Benchmarks for Oligonucleotide Drug Discovery
Abstract
Oligonucleotide therapeutics offer great potential to address previously undruggable targets and enable personalized medicine. However, their progress is often hindered by insufficient safety and efficacy profiles. Predictive modeling and machine learning could significantly accelerate oligonucleotide drug discovery by identifying suboptimal compounds early on, but their application in this area lags behind other modalities. A key obstacle to the adoption of machine learning in the field is the scarcity of readily accessible and standardized datasets for model development, as data are often scattered across diverse experiments with inconsistent molecular representations. To overcome this challenge, we introduce OligoGym, a curated collection of standardized, machine learning-ready datasets encompassing various oligonucleotide therapeutic modalities and endpoints. We used OligoGym to benchmark diverse classical and deep learning methods, establishing performance baselines for each dataset across different featurization techniques, model configurations, and splitting strategies. Our work represents a crucial first step in creating a more unified framework for oligonucleotide therapeutic dataset generation and model training.