Zero-Shot Protein–Ligand Binding-Residue Prediction from Sequence and SMILES
Mahdi Pourmirzaei · Salhuldin Alqarghuli · Kai Chen · Mohammadreza Pourmirzaeioliaei · Dong Xu
Abstract
Accurate identification of protein--ligand binding residues is critical for mechanistic biology and drug discovery, yet performance varies widely across ligand families and data regimes. We present a systematic evaluation framework that stratifies ligands into three settings, \emph{overrepresented} (many examples), \emph{underrepresented} (tens of examples), and \emph{zero-shot} (unseen at training). We developed a three-stage, sequence-based modeling suite that progressively adds ligand conditioning and \emph{zero-shot} capability, and used an evaluation framework to assess the suite. Stage 1 trains per-ligand predictors using a pretrained protein language model (PLM). Stage 2 introduces ligand-aware conditioning via an embedding table, enabling a single multi-ligand model. Stage 3 replaces the table with a pretrained chemical language model (CLM) operating on SMILES, enabling \emph{zero-shot} generalization. We show Stage 2 improves Macro~$F_1$ on the \emph{overrepresented} test set from 0.4769 (Stage 1) to 0.5832 and outperforms sequence- and structure-based baselines. Stage 3 attains \emph{zero-shot} performance ($F_1=0.3109$) on 5\,612 previously unseen ligands while remaining competitive on represented ligands. Ablations across five PLM scales and multiple CLMs reveal larger PLM backbones consistently increase Macro $F_1$ across all regimes, whereas scaling the CLM yields modest or inconsistent gains, which need further investigation. Our results demonstrate that \emph{zero-shot} residue-level prediction from sequence and SMILES is feasible and identifies the PLM scale as the dominant lever for further advances.
Chat is not available.
Successful Page Load