Timezone: »

Active site sequence representation of human kinases outperforms full sequence for affinity prediction
Jannis Born · Tien Huynh · Astrid Stroobants · Wendy Cornell · Matteo Manica

Focusing on the human kinome, we challenge a standard practice in proteochemo-metric, sequence-based affinity prediction models: instead of leveraging the full primary structure of proteins, each target is represented only by a sequence of 29 residues defining the ATP binding site. In kinase-ligand binding prediction, our results show that the reduced active site sequence representation is not only computationally more efficient but consistently yields significantly higher performance than the full primary structure. This trend persists across different models (a k-NN baseline and a multimodal deep neural network), datasets (BindingDB, IDG-DREAM), performance metrics (RMSE, Pearson correlation) and holds true when predicting affinity for both unseen ligands and kinases. For example, the RMSE on pIC50 can be reduced by 5% and 9% respectively for unseen kinases and kinase inhibitors. This trend is robust across kinases’ families and classes of inhibitors with a few exceptions where the necessity of full sequence is explained by the drugs mechanism of action. Our interpretability analysis further demonstrates that, even without supervision, the full sequence model can learn to focus on the active site residues to a higher extent. Overall, this work challenges the assumption that full primary structure is indispensable for virtual screening of human kinases.

Author Information

Jannis Born (IBM Research)
Tien Huynh (IBM Research)
Astrid Stroobants (Imperial College London)
Wendy Cornell (IBM Research)
Matteo Manica (IBM Research Zürich)

Matteo is a Pre-Doc in Cognitive Health Care and Life Sciences Department at IBM Zürich Research Laboratory. He is enrolled in a joint PhD programme with Institute of Molecular Systems Biology, ETH - Zürich. His research is focused on the development of predictive computational technologies and learning frameworks to exploit and integrate multiple molecular and clinical data in the context of cancer medicine in order to improve patients stratification and inform clinicians with personalized therapeutic interventions. He is currently working on the application of machine and deep learning methods to analyze progression and development of prostate cancer in the context an H2020 EU project, PrECISE. Before joining IBM, Matteo worked as consultant in data science and software development with specific applications in biological fluids dynamic, digital and biological signal processing and data analysis. The main focus was on the analysis of CT angiography and MR angiography scans of abdominal aortic aneurysms (AAA). Trough image analysis, segmentation and 3D volume rendering of the abdominal aorta he contributed to create patient specific models to simulate blood flows in the vessels and to assess rupture risk of the aneurysm. He obtained his BSc and MSc at Politecnico di Milano in Applied Mathematics and Computer Science, a course with a strong focus on numerical simulations and data analysis. In his master thesis work he developed an original model, based partial different equations for flow in porous media, to describe Medulloblastoma growth. By analysing MRIs at different time points of a given patient it was possible to fit the model trough segmentation and 3D volume rendering of the brain and the tumor mass, enabling an accurate estimate of the disease’s course over time.

More from the Same Authors