First Comprehensive Benchmark for Tailored Small Molecule-Binding Aptamer Design
Abstract
Aptamers are emerging as robust recognition elements for diagnostics and therapeutics, yet computational discovery pipelines remain limited to proteins, leaving small-molecule binding largely unexplored. To fill this gap, we present the first unified benchmark for aptamer–small molecule interactions, built from seven curated sources and comprising 2,200 annotated pairs, 1,430 unique aptamers (DNA and RNA), and 496 ligands spanning a broad chemical space. Over half of the entries include quantitative binding affinities, enabling both classification and regression tasks, while synthetic negatives generated via cross-pair sampling support more realistic model training. Using this dataset, we conducted a systematic benchmarking study across multiple representation strategies for both aptamers and ligands. Our experiments covered discrete encodings, pretrained embeddings, and hybrid fusion schemes, evaluated with both shallow and deep learning models. This analysis establishes stable baselines for binding prediction and reveals complementary strengths of sequence-based and embedding-based features. Beyond classification, we also provide the first regression baselines for continuous affinity prediction. Building on these results, we outline a reinforcement learning pipeline with supervised models guiding the conditional generation of aptamer sequences for arbitrary small molecules. This framework represents the next step toward scalable, data-driven aptamer discovery beyond SELEX and molecular docking.