PrimerCast: Predictive Modeling of PCR Amplification with an AI-Ready Experimental Dataset
Abstract
Polymerase Chain Reaction (PCR) is the foundational technology for detecting pathogens from their genomic sequences. Rapidly selecting effective primers is critical in outbreak settings, yet predicting whether a candidate primer pair will amplify a target sequence depends on a complex mix of sequence–sequence interactions, nonlinear thermodynamics, and cross-genome variability. As a result, existing tools rely on hand-crafted heuristics and qualitative rules for design. From these candidates, high-performing primers are typically identified only through slow, trial-and-error laboratory testing.To overcome this bottleneck, we introduce the first AI-ready, large-scale, experimentally measured dataset of PCR amplification outcomes. The dataset comprises 50,760 unique primer–target reactions spanning 141 viral detection targets and 360 primer sets, with quantitative labels of amplification efficiency at unprecedented scale. Leveraging this resource, we train PrimerCast, a predictive model that forecasts amplification success and efficiency. PrimerCast consistently outperforms both widely used heuristic tools and prior baselines, enabling reliable, data-driven primer evaluation.By framing primer performance as a predictive modeling challenge and releasing an AI-ready dataset, we provide a foundation for faster, more reliable diagnostic test development and open the door for applying machine learning to broader nucleic acid technologies.