Timezone: »
The first decade of genome sequencing saw a surge in the characterisation of proteins with unknown functionality. Even still, more than 20% of proteins in well-studied model animals have yet to be identified, making the discovery of their active site one of biology's greatest difficulties. Herein, we apply a transformer architecture to a language representation of bio-catalyzed chemical reactions to learn the signal at the base of the substrate-active site atomic interactions. The language representation comprises a reaction simplified molecular-input line-entry system (SMILES) for substrate and products, complemented with amino acid (AA) sequence information for the enzyme. Defining a custom tokenizer and a score based on attention values, we show we can capture the substrate-active site interaction signal and use it to detect the location of the active site in unknown protein sequences, hence elucidating complex 3D interactions solely relying on 1D representations.We consider a Transfomer-based model, BERT, trained with different losses and analyse the performance in comparison with a statistical baseline and methods based on sequence alignments. Our approach exhibits remarkable results and is able to recover, with no supervision, 31.51% of the active site when considering co-crystallized substrate-enzyme structures as a ground truth, largely outperforming sequence alignment-based approaches. Our findings are further corroborated by docking simulations on the 3D structure of few enzymes. This work confirms the unprecedented impact of natural language processing and more specifically of the transformer architecture on domain-specific languages, paving the way to effective solutions for protein functional characterisation and bio-catalysis engineering.
Author Information
Loïc Kwate Dassi (National School of Computer Science and Applied Mathematics of Grenobl)
Matteo Manica (IBM Research Zürich)
Matteo is a Pre-Doc in Cognitive Health Care and Life Sciences Department at IBM Zürich Research Laboratory. He is enrolled in a joint PhD programme with Institute of Molecular Systems Biology, ETH - Zürich. His research is focused on the development of predictive computational technologies and learning frameworks to exploit and integrate multiple molecular and clinical data in the context of cancer medicine in order to improve patients stratification and inform clinicians with personalized therapeutic interventions. He is currently working on the application of machine and deep learning methods to analyze progression and development of prostate cancer in the context an H2020 EU project, PrECISE. Before joining IBM, Matteo worked as consultant in data science and software development with specific applications in biological fluids dynamic, digital and biological signal processing and data analysis. The main focus was on the analysis of CT angiography and MR angiography scans of abdominal aortic aneurysms (AAA). Trough image analysis, segmentation and 3D volume rendering of the abdominal aorta he contributed to create patient specific models to simulate blood flows in the vessels and to assess rupture risk of the aneurysm. He obtained his BSc and MSc at Politecnico di Milano in Applied Mathematics and Computer Science, a course with a strong focus on numerical simulations and data analysis. In his master thesis work he developed an original model, based partial different equations for flow in porous media, to describe Medulloblastoma growth. By analysing MRIs at different time points of a given patient it was possible to fit the model trough segmentation and 3D volume rendering of the brain and the tumor mass, enabling an accurate estimate of the disease’s course over time.
Daniel Probst (IBM Research Europe)
Philippe Schwaller (IBM Research Europe)
Yves Gaetan Nana Teukam (Sapienza University of Rome)
Teodoro Laino (IBM Research Zurich)
More from the Same Authors
-
2021 : Active site sequence representation of human kinases outperforms full sequence for affinity prediction »
Jannis Born · Tien Huynh · Astrid Stroobants · Wendy Cornell · Matteo Manica -
2021 : Human-in-the-loop for a Disconnection Aware Retrosynthesis »
Andrea Byekwaso · Philippe Schwaller · Alain C. Vaucher · Teodoro Laino -
2022 : Standardization of chemical compounds using language modeling »
Miruna Cretu · Alessandra Toniato · Alain C. Vaucher · Amol Thakkar · Amin Debabeche · Teodoro Laino -
2020 : Spotlight Talk: Data augmentation strategies to improve reaction yield predictions and estimate uncertainty - Philippe Schwaller, Alain Vaucher, Teodoro Laino and Jean-Louis Reymond »
Philippe Schwaller -
2020 Poster: CogMol: Target-Specific and Selective Drug Design for COVID-19 Using Deep Generative Models »
Vijil Chenthamarakshan · Payel Das · Samuel Hoffman · Hendrik Strobelt · Inkit Padhi · Kar Wai Lim · Benjamin Hoover · Matteo Manica · Jannis Born · Teodoro Laino · Aleksandra Mojsilovic -
2020 : Contributed Talk 1: Question Generation With Deep Reinforcement Learning for Education »
Loïc Kwate Dassi -
2019 : Afternoon Coffee Break & Poster Session »
Heidi Komkov · Stanislav Fort · Zhaoyou Wang · Rose Yu · Ji Hwan Park · Samuel Schoenholz · Taoli Cheng · Ryan-Rhys Griffiths · Chase Shimmin · Surya Karthik Mukkavili · Philippe Schwaller · Christian Knoll · Yangzesheng Sun · Keiichi Kisamori · Gavin Graham · Gavin Portwood · Hsin-Yuan Huang · Paul Novello · Moritz Munchmeyer · Anna Jungbluth · Daniel Levine · Ibrahim Ayed · Steven Atkinson · Jan Hermann · Peter Grönquist · · Priyabrata Saha · Yannik Glaser · Lingge Li · Yutaro Iiyama · Rushil Anirudh · Maciej Koch-Janusz · Vikram Sundar · Francois Lanusse · Auralee Edelen · Jonas Köhler · Jacky H. T. Yip · jiadong guo · Xiangyang Ju · Adi Hanuka · Adrian Albert · Valentina Salvatelli · Mauro Verzetti · Javier Duarte · Eric Moreno · Emmanuel de Bézenac · Athanasios Vlontzos · Alok Singh · Thomas Klijnsma · Brad Neuberg · Paul Wright · Mustafa Mustafa · David Schmidt · Steven Farrell · Hao Sun -
2018 : Contributed Work »
Thaer Moustafa Dieb · Aditya Balu · Amir H. Khasahmadi · Viraj Shah · Boris Knyazev · Payel Das · Garrett Goh · Georgy Derevyanko · Gianni De Fabritiis · Reiko Hagawa · John Ingraham · David Belanger · Jialin Song · Kim Nicoli · Miha Skalic · Michelle Wu · Niklas Gebauer · Peter Bjørn Jørgensen · Ryan-Rhys Griffiths · Shengchao Liu · Sheshera Mysore · Hai Leong Chieu · Philippe Schwaller · Bart Olsthoorn · Bianca-Cristina Cristescu · Wei-Cheng Tseng · Seongok Ryu · Iddo Drori · Kevin Yang · Soumya Sanyal · Zois Boukouvalas · Rishi Bedi · Arindam Paul · Sambuddha Ghosal · Daniil Bash · Clyde Fare · Zekun Ren · Ali Oskooei · Minn Xuan Wong · Paul Sinz · Théophile Gaudin · Wengong Jin · Paul Leu -
2017 : Poster session 1 »
Van-Doan Nguyen · Stephan Eismann · Haozhen Wu · Garrett Goh · Kristina Preuer · Thomas Unterthiner · Matthew Ragoza · Tien-Lam PHAM · Günter Klambauer · Andrea Rocchetto · Maxwell Hutchinson · Qian Yang · Rafael Gomez-Bombarelli · Sheshera Mysore · Brooke Husic · Ryan-Rhys Griffiths · Masashi Tsubaki · Emma Strubell · Philippe Schwaller · Théophile Gaudin · Michael Brenner · Li Li -
2017 : Poster spotlights »
Emma Strubell · Garrett Goh · Masashi Tsubaki · Théophile Gaudin · Philippe Schwaller · Matthew Ragoza · Rafael Gomez-Bombarelli