Predicting Kinase-Specific Phosphorylation Sites with Pretrained Protein Language Models
Abstract
Accurately predicting kinase-specific phosphorylation sites remains difficult due to the diversity of kinases and the context-dependent nature of substrate recognition. Importantly, aberrant kinase overactivation is a hallmark of many cancers including colorectal, gastric, liver, and breast tumors where dysregulated kinase signaling promotes malignant transformation,tumor progression, and therapy resistance. This underscores the clinical importance of understanding kinase-substrate relationships and precisely mapping phosphorylation events. In this paper, we introduce two complementary sequence-based architectures that operate directly on full-length substrate and kinase sequences. Stage 1 extends a task-agnostic prediction method, named Prot2Token, to jointly support three tasks: kinase-group classification from substrate sequences alone, kinase-substrate interaction prediction, and kinase-specific phosphorylation-site prediction while incorporating a self-supervised decoder pretraining task that predicts amino-acid positions from encoder embeddings. This pretraining substantially strengthens site prediction. Stage 2 specializes the architecture for phosphorylation-site prediction by replacing causal decoding of Prot2Token with a bidirectional one, yielding further gains. On standard benchmarks, the specialized model consistently outperforms widely used baselines. Beyond in-distribution evaluation, across both in-distribution and zero-shot settings of understudied dark kinases, we show the sign of zero-shot kinase-specific phosphorylation-site prediction capability. Together, these results indicate that jointly modeling substrate and kinase sequences provides a straightforward, scalable approach to state-of-the-art, zero-shot-capable phosphorylation-site prediction.