Revolutionize drug discovery with dense PPI data
Abstract
Drug development faces persistent tradeoffs between efficacy, safety, and developability, but existing foundation models cannot reliably predict binding affinity—the central challenge for therapeutic design. This limitation stems from sparse protein–protein interaction (PPI) datasets, which largely reflect natural protein pairs and encourage memorization rather than generalization. We propose dense PPI datasets that systematically sample mutational neighborhoods, compelling models to learn transferable interaction principles. Using scalable FACS and sequencing, billions of labeled data points can be generated at reasonable cost. These datasets would enable PPI-specific foundation models with accurate affinity prediction, improved structure modeling, and efficient exploration of interaction-aware sequence landscapes, with transformative impact on drug discovery, diagnostics, synthetic biology, and the broader life sciences.