`

Timezone: »

 
Workshop
Machine Learning in Structural Biology
Ellen Zhong · Raphael Townshend · Stephan Eismann · Namrata Anand · Roshan Rao · John Ingraham · Wouter Boomsma · Sergey Ovchinnikov · Bonnie Berger

Mon Dec 13 06:00 AM -- 04:00 PM (PST) @ None
Event URL: https://www.mlsb.io/ »

Structural biology, the study of proteins and other biomolecules through their 3D structures, is a field on the cusp of transformation. While measuring and interpreting biomolecular structures has traditionally been an expensive and difficult endeavor, recent machine-learning based modeling approaches have shown that it will become routine to predict and reason about structure at proteome scales with unprecedented atomic resolution. This broad liberation of 3D structure within bioscience and biomedicine will likely have transformative impacts on our ability to create effective medicines, to understand and engineer biology, and to design new molecular materials and machinery. Machine learning also shows great promise to continue to revolutionize many core technical problems in structural biology, including protein design, modeling protein dynamics, predicting higher order complexes, and integrating learning with experimental structure determination.

At this inflection point, we hope that the Machine Learning in Structural Biology (MLSB) workshop will help bring community and direction to this rising field. To achieve these goals, this workshop will bring together researchers from a unique and diverse set of domains, including core machine learning, computational biology, experimental structural biology, geometric deep learning, and natural language processing.

Mon 6:00 a.m. - 6:10 a.m.
Opening remarks
Mon 6:10 a.m. - 6:30 a.m.
Invited Talk 1: Michael Bronstein (Invited talk)
Michael Bronstein
Mon 6:30 a.m. - 6:50 a.m.
Invited Talk 2: Cecilia Clementi (Invited talk)
Cecilia Clementi
Mon 6:50 a.m. - 7:10 a.m.
Invited Talk 3: Lucy Colwell (Invited talk)   
Lucy Colwell
Mon 7:10 a.m. - 7:20 a.m.
(Oral)   

Structure-based drug design involves finding ligand molecules that exhibit structural and chemical complementarity to protein pockets. Deep generative methods have shown promise in proposing novel molecules from scratch (de-novo design), avoiding exhaustive virtual screening of chemical space. Most generative de-novo models fail to incorporate detailed ligand-protein interactions and 3D pocket structures. We propose a novel supervised model that generates molecular graphs jointly with 3D pose in a discretised molecular space. Molecules are built atom-by-atom inside pockets, guided by structural information from crystallographic data. We evaluate our model using a docking benchmark and find that guided generation improves predicted binding affinities by 8% and drug-likeness scores by 10% over the baseline. Furthermore, our model proposes molecules with binding scores exceeding some known ligands, which could be useful in future wet-lab studies.

Pavol Drotar · Arian Jamasb · Ben Day · Catalina Cangea · Pietro Lió
Mon 7:20 a.m. - 7:30 a.m.
(Oral)   

In drug discovery, structure-based high-throughput virtual screening (vHTS) campaigns aim to identify bioactive ligands or 'hits' for therapeutic protein targets from docked poses at specific binding sites. However, while generally successful at this task, many deep learning methods are known to be insensitive to protein-ligand interactions, decreasing the reliability of hit detection and hindering discovery at novel binding sites. Here, we overcome this limitation by introducing a class of models with two key features: 1) we condition bioactivity on pose quality score, and 2) we present poor poses of true binders to the model as negative examples. The conditioning forces the model to learn details of physical interactions. We evaluate these models on a new benchmark designed to detect pose-sensitivity.

Pawel Gniewek · Kate Stafford
Mon 7:30 a.m. - 7:40 a.m.
(Oral)   

Cryo-electron microscopy (cryo-EM) has revolutionized experimental protein structure determination. Despite advances in high resolution reconstruction, a majority of cryo-EM experiments provide either a single state of the studied macromolecule, or a relatively small number of its conformations. This reduces the effectiveness of the technique for proteins with flexible regions, which are known to play a key role in protein function. Recent methods for capturing conformational heterogeneity in cryo-EM data model the volume space, making recovery of continuous atomic structures challenging. Here we present a fully deep-learning-based approach using variational auto-encoders (VAEs) to recover a continuous distribution of atomic protein structures and poses directly from picked particle images and demonstrate its efficacy on realistic simulated data. We hope that methods built on this work will allow incorporation of stronger prior information about protein structure and enable better understanding of non-rigid protein structures.

Dan Rosenbaum · Marta Garnelo · Charles Beattie · Andrea Huber · Pushmeet Kohli · Andrew Senior · John Jumper · Carl Doersch · Ali Eslami · Olaf Ronneberger · Jonas Adler
Mon 7:40 a.m. - 7:50 a.m.
(Oral)   

Deep learning-based object detection methods have shown promising results in various fields ranging from autonomous driving to video surveillance where input images have relatively high signal-to-noise ratios (SNR). On low SNR images such as biological electron microscopy (EM) data, however, the performance of these algorithms is significantly lower. Moreover, biological data typically lacks standardized annotations further complicating the training of detection algorithms. Accurate identification of proteins from EM images is a critical task, as the detected positions serve as inputs for the downstream 3D structure determination process. To overcome the low SNR and lack of annotations, we propose a joint weakly-supervised learning framework that performs image denoising while detecting objects of interest. Our framework denoises images without the need of clean images and is able to detect particles of interest even when less than 0.5% of the data are labeled. We validate our approach on three extremely low SNR cryo-EM datasets and show that our strategy outperforms existing state-of-the-art (SofA) methods used in the cryo-EM field by a significant margin.

Wendy Huang · Alberto Bartesaghi · Ye Zhou
Mon 7:50 a.m. - 8:30 a.m.
Keynote 1: John Jumper (Keynote speaker)   
John Jumper
Mon 8:30 a.m. - 9:30 a.m.
Poster Session 1 (Poster Session)
Mon 9:30 a.m. - 10:30 a.m.
Panel Discussion
Mon 10:30 a.m. - 11:10 a.m.
Keynote 2: Jane Richardson (Keynote speaker)
Jane Richardson
Mon 11:10 a.m. - 11:20 a.m.
Break
Mon 11:20 a.m. - 11:30 a.m.
(Oral)   

Proteins undergo structural fluctuations in vivo which can lead to the formation of pockets unseen in the native, folded structural state (i.e. “cryptic pockets”). Inferring cryptic pockets from experimentally determined protein structures is valuable when developing a drug since ligands typically require a pocket for tight binding. Toward this end, many studies employ molecular dynamics simulations to model protein structural fluctuations, but these simulations often require 100s of GPU hours. We hypothesized that machine learning algorithms that predict sites of cryptic pockets directly from folded structures can speed this up. Here, we adapt a graph neural network architecture, which previously achieved state-of-the-art performance on protein structure learning tasks, to predict sites of cryptic pocket formation from experimental protein structures. We trained this model by re-purposing an existing molecular simulation dataset that was generated to identify cryptic pockets in SARS-CoV-2 proteins. Our model achieves good performance (AUC=0.78) on a held-out test set of protein structures with ligands bound to cryptic sites and requires <1 second of compute on a single GPU.

Artur Meller · Michael Ward · Juan Lavista Ferres · Greg Bowman
Mon 11:30 a.m. - 11:40 a.m.
(Oral)   

Multiple Sequence Alignments (MSAs) of homologous sequences contain information on structural and functional constraints and their evolutionary histories. Despite their importance for many downstream tasks, such as structure prediction, MSA generation is often treated as a separate pre-processing step, without any guidance from the application it will be used for. Here, we implement a smooth and differentiable version of the Smith-Waterman pairwise alignment algorithm that enables jointly learning an MSA and a downstream machine learning system in an end-to-end fashion. To demonstrate its utility, we introduce SMURF (Smooth Markov Unaligned Random Field), a new method that jointly learns an alignment and the parameters of a Markov Random Field for unsupervised contact prediction. We find that SMURF mildly improves contact prediction on a diverse set of protein and RNA families. In another application, we demonstrate that connecting our differentiable alignment module to AlphaFold2 and optimizing the learnable alignments leads to improved structure predictions with a higher confidence. This work highlights the potential of differentiable dynamic programming to improve neural network pipelines that rely on an alignment.

Samantha Petti · Nicholas Bhattacharya · Roshan Rao · Neil Thomas · Alexander Rush · Peter Koo · Sergey Ovchinnikov
Mon 11:40 a.m. - 11:50 a.m.
(Oral)   

Protein design is challenging because it requires searching through a vast combinatorial space that is only sparsely functional. Self-supervised learning approaches offer the potential to navigate through this space more effectively and thereby accelerate protein engineering. We introduce a sequence denoising autoencoder (DAE) that learns the manifold of protein sequences from a large amount of potentially unlabelled proteins. This DAE is combined with a function predictor that guides sampling towards sequences with higher levels of desired functions. We train the sequence DAE on more than 20M unlabeled protein sequences spanning many evolutionarily diverse protein families and train the function predictor on approximately 0.5M sequences with known function labels. At test time, we sample from the model by iteratively denoising a sequence while exploiting the gradients from the function predictor. We present a few preliminary case studies of protein design that demonstrate the effectiveness of this proposed approach, which we refer to as “deep manifold sampling”, including metal binding site addition, function-preserving diversification, and global fold change.

Vladimir Gligorijevic · Stephen Ra
Mon 11:50 a.m. - 12:00 p.m.
(Oral)   

In response to pathogens, the adaptive immune system generates specific antibodies that bind and neutralize foreign antigens. Understanding the composition of an individual's immune repertoire can provide insights into this process and reveal potential therapeutic antibodies. In this work, we explore the application of antibody-specific language models to aid understanding of immune repertoires. We introduce AntiBERTy, a language model trained on 558M natural antibody sequences. We find that within repertoires, our model clusters antibodies into trajectories resembling affinity maturation. Importantly, we show that models trained to predict highly redundant sequences under a multiple instance learning framework identify key binding residues in the process. With further development, the methods presented here will provide new insights into antigen binding from repertoire sequences alone.

Jeffrey Ruffolo · Jeremias Sulam
Mon 12:00 p.m. - 12:10 p.m.
(Oral)   

We explore the use of modern variational autoencoders for generating protein structures. Models are trained across a diverse set of natural protein domains. Three-dimensional structures are encoded implicitly in the form of an energy function that expresses constraints on pairwise distances and angles. Atomic coordinates are recovered by optimizing the parameters of a rigid body representation of the protein chain to fit the constraints. The model generates diverse structures across a variety of folds, and exhibits local coherence at the level of secondary structure, generating alpha helices and beta sheets, as well as globally coherent tertiary structure. A number of generated protein sequences have high confidence predictions by AlphaFold that agree with their designs. The majority of these have no significant sequence homology to natural proteins.

Zeming Lin · Tom Sercu · yann lecun · Alex Rives
Mon 12:10 p.m. - 1:10 p.m.
Poster Session 2 (Poster Session)
Mon 1:10 p.m. - 1:30 p.m.
Invited Talk 4: Derek Lowe (Invited talk)
Derek Lowe
Mon 1:30 p.m. - 1:50 p.m.
Invited Talk 5: Regina Barzilay (Invited talk)   
Regina Barzilay
Mon 1:50 p.m. - 2:10 p.m.
Invited Talk 6: Amy Keating (Invited talk)   
Amy Keating
Mon 2:10 p.m. - 2:20 p.m.
Closing remarks
Mon 2:20 p.m. - 4:00 p.m.
Social hour
-
(Poster) [ Visit Poster at Spot H0 in Virtual World ]   

Predicting a structure of an antibody from its sequence is important since it allows for a better design process of synthetic antibodies that play a vital role in the health industry. Most of the structure of an antibody is conservative. The most variable and hard-to-predict part is the {\it third complementarity-determining region of the antibody heavy chain} (CDR H3). Lately, deep learning has been employed to solve the task of CDR H3 prediction. However, current state-of-the-art methods are not end-to-end, but rather they output inter-residue distances and orientations to the RosettaAntibody package that uses this additional information alongside statistical and physics-based methods to predict the 3D structure. This does not allow a fast screening process and, therefore, inhibits the development of targeted synthetic antibodies. In this work, we present an end-to-end model to predict CDR H3 loop structure, that performs on par with state-of-the-art methods in terms of accuracy but an order of magnitude faster. We also raise an issue with a commonly used RosettaAntibody benchmark that leads to data leaks, i.e., the presence of identical sequences in the train and test datasets.

Ekaterina Sedykh · Aleksei Shpilman
-
(Poster) [ Visit Poster at Spot G3 in Virtual World ]   

Recent advances in machine learning have enabled generative models for both optimization and de novo generation of drug candidates with desired properties. Previous generative models have focused on producing SMILES strings or 2D molecular graphs, while attempts at producing molecules in 3D have focused on reinforcement learning (RL), distance matrices, and pure atom density grids. Here we present MOLUCINATE (MOLecUlar ConvolutIoNal generATive modEl), a novel architecture that simultaneously generates topological and 3D atom position information. We demonstrate the utility of this method by using it to optimize molecules for desired radius of gyration. In the future, this model can be used for more useful optimization such as binding affinity against a protein target.

Michael Arcidiacono · Dave Koes
-
(Poster) [ Visit Poster at Spot G2 in Virtual World ]   

Focusing on the human kinome, we challenge a standard practice in proteochemo-metric, sequence-based affinity prediction models: instead of leveraging the full primary structure of proteins, each target is represented only by a sequence of 29 residues defining the ATP binding site. In kinase-ligand binding prediction, our results show that the reduced active site sequence representation is not only computationally more efficient but consistently yields significantly higher performance than the full primary structure. This trend persists across different models (a k-NN baseline and a multimodal deep neural network), datasets (BindingDB, IDG-DREAM), performance metrics (RMSE, Pearson correlation) and holds true when predicting affinity for both unseen ligands and kinases. For example, the RMSE on pIC50 can be reduced by 5% and 9% respectively for unseen kinases and kinase inhibitors. This trend is robust across kinases’ families and classes of inhibitors with a few exceptions where the necessity of full sequence is explained by the drugs mechanism of action. Our interpretability analysis further demonstrates that, even without supervision, the full sequence model can learn to focus on the active site residues to a higher extent. Overall, this work challenges the assumption that full primary structure is indispensable for virtual screening of human kinases.

Jannis Born · Matteo Manica
-
(Poster) [ Visit Poster at Spot G1 in Virtual World ]

Many different types of generative models for protein sequences have been proposed in literature. Their uses include the prediction of mutational effects, protein design and the prediction of structural properties. Neural network (NN) architectures have shown great performances, commonly attributed to the capacity to extract non-trivial higher-order interactions from the data. In this work, we analyze three different NN models and assess how close they are to simple pairwise distributions, which have been used in the past for similar problems. We present an approach for extracting pairwise models from more complex ones using an energy-based modeling framework. We show that for the tested models the extracted pairwise models can replicate the energies of the original models and are also close in performance in tasks like mutational effect prediction.

Christoph Feinauer · Barthélémy Meynard · Carlo Lucibello
-
(Poster) [ Visit Poster at Spot G0 in Virtual World ]   

Predicting the physical interaction of proteins is a cornerstone problem in computational biology. New learning-based algorithms are typically trained end-to-end on protein structures extracted from the Protein Data Bank. However, these training datasets tend to be large and difficult to use for prototyping and, unlike image or natural language datasets, they are not easily interpretable by non-experts. In this paper we propose Dock2D-IP and Dock2D-FI, two toy datasets that can be used to select algorithms predicting protein-protein interactions (or any other type of molecular interactions). Using two-dimensional shapes as input, each example from Dock2D-FI describes the fact of interaction (FI) between two shapes and each example from Dock2D-IP describes the interaction pose (IP) of two shapes known to interact. We propose baselines that represent different approaches to the problem and demonstrate the potential for transfer learning across the IP prediction and FI prediction tasks.

Georgy Derevyanko · Guillaume Lamoureux
-
(Poster) [ Visit Poster at Spot F3 in Virtual World ]   

Signal peptides are essential for protein sorting and processing. Evaluating signal peptides experimentally is difficult and prone to errors, therefore the exact cleavage sites are often misannotated. Here, we describe a novel explainable method to identify signal peptides and predict the cleavage site, with a performance similar to state-of-the art methods. We treat each amino acid sequence as a sentence and its annotation as a translation problem. We utilise attention neural networks in a transformer model using a simple one-hot representation of each amino acid, without including any evolutionary information. By analysing the encoder-decoder attention of the trained network, we are able to explain what information in the peptide is used to annotate the cleavage site. We find the most common signal peptide motifs and characteristics and confirm that the most informative amino acid sites vary greatly between kingdoms and signal peptide types as previous studies have shown. Our findings open up the possibility to gain biological insight using transformer neural networks on small sets of labelled information.

Patrick Bryant · Arne Elofsson
-
(Poster) [ Visit Poster at Spot F2 in Virtual World ]   

The increasing integration between protein engineering and machine learning has led to many interesting results. A problem still to solve is to evaluate the likelihood that a sequence will fold into a target structure. This problem can be also viewed as sequence prediction from a known structure.

In the current work, we propose improvements in the recent architecture of Geometric Vector Perceptrons in order to optimize the sampling of sequences from a known backbone structure. The proposed model differs from the original in that there is: (i) no updating in the vectorial embedding, only in the scalar one, (ii) only one layer of decoding. The first aspect improves the accuracy of the model and reduces the use of memory, the second allows for training of the model with several tasks without incurring data leakage.

We treat the trained classifier as an Energy-Based Model and sample sequences by sampling amino acids in a non-autoreggresive manner in the empty positions of the sequence using energy-guided criteria and followed by random mutation optimization. We improve the median identity of samples from 40.2% to 44.7%.

An additional question worth investigating is whether sampled and original sequences fold into similar structures independent of their identity. We chose proteins in our test set whose sampled sequences show low identity (under 30%) but for which our model predicted favorable energies. We used AlphaFold and observed that the predicted structures for sampled sequences highly resemble the predicted structures for original sequences, with an average TM score of 0.848.

Gabriel Orellana · Javier Caceres-Delpiano · Roberto Ibanez · Leo Alvarez
-
(Poster) [ Visit Poster at Spot F1 in Virtual World ]   
A common challenge in drug design pertains to finding chemical modifications to a ligand that increases its affinity to the target protein. An untapped advance is the increase in structural biology throughput, which has progressed from an artisanal endeavour to a monthly throughput of up to 100 different ligands against a protein in modern synchrotrons. However, the missing piece is a framework that turns high throughput crystallography data into predictive ligand design. Here, we design a simple machine learning approach that predicts protein-ligand affinity from experimental structures of diverse ligands against a single protein paired with biochemical measurements. Our key insight is using physics-based energy descriptors to represent protein-ligand complexes, and a learning-to-rank approach that infers the relevant differences between binding modes. We ran a high throughput crystallography campaign against the SARS-CoV-2 Main Protease (M$^{\mathrm{Pro}}$) involving over 200 protein-ligand complexes and developed models that could predict with high accuracy the relative binding strength of the ligands through the timecourse of the campaign. Crucially, our approach successfully extends ligands to unexplored regions of the binding pocket, executing large and fruitful moves in chemical space with simple chemistry. We have used this approach to design compounds that improved the potency of two different micromolar hits by over 10-fold, arriving at a lead compound with 80 nM antiviral efficacy -- amongst the highest to date reported for non-covalent inhibitors.
Kadi Saar · John Chodera · Alpha Lee
-
(Poster) [ Visit Poster at Spot F0 in Virtual World ]   

Proteins ensure their biological functions by interacting with each other, and with other molecules. Determining the relative position and orientation of protein partners in a complex remains challenging. Here, we address the problem of ranking candidate complex conformations toward identifying near-native conformations. We propose a deep learning approach relying on a local representation of the protein interface with an explicit account of its geometry. We show that the method is able to recognise certain pattern distributions in specific locations of the interface. We compare and combine it with a physics-based scoring function and a statistical pair potential.

Yasser Mohseni Behbahani
-
(Poster) [ Visit Poster at Spot E3 in Virtual World ]   

Engineering a protein’s stability improves its shelf life and expands its application environment. Current studies of protein stability often involve predicting stability change from single-point mutations. However, the prediction model must be able to resolve single-character difference in a protein sequence typically several hundred amino acids long.

In this study, we predicted single-point mutational effect on protein stability and compared sequence-only and geometric learning approaches. Despite the inclusion of structural information, we showed that geometric learning does not outperform non-geometric models. Surprisingly, a simple MLP incorporating only the embed- ding at the mutation site performs the best. The finding could be attributed to the limited non-local mutational effect in the embedding.

Simon Chu
-
(Poster) [ Visit Poster at Spot E2 in Virtual World ]   

During lead optimization, lead molecules are refined for potency via slight modifications of their chemical structure. Relative binding free energy (RBFE) methods allow comparisons of molecular potency during this optimization. We utilize a Siamese Convolutional Neural Network (CNN) to directly estimate the RBFE with higher throughput than simulation based methods. Our models show improved performance over a previously published Siamese RBFE predictor. We observe decreased performance on out-of-domain RBFE predictions.

Andrew McNutt · Dave Koes
-
(Poster) [ Visit Poster at Spot E1 in Virtual World ]   

Three-dimensional structure prediction tools offer a rapid means to approximate the topology of a protein structure for any protein sequence. Recent progress in deep learning-based structure prediction has led to highly accurate predictions that have recently been used to systematically predict 20 whole proteomes by DeepMind’s AlphaFold and the EMBL-EBI. While highly convenient, structure prediction tools lack much of the functional context presented by experimental studies, such as binding sites or post-translational modifications. Here, we introduce a machine learning framework to rapidly model any residue-based classification using AlphaFold2 structure-augmented protein representations. Specifically, graphs describing the 3D structure of each protein in the AlphaFold2 human proteome are generated and used as input representations to a Graph Convolutional Network (GCN), which annotates specific regions of interest based on the structural attributes of the amino acid residues, including their local neighbors. We demonstrate the approach using six varied amino acid classification tasks.

Nasim Abdollahi
-
(Poster) [ Visit Poster at Spot E0 in Virtual World ]

Protein sequences follow a discrete alphabet rendering gradient-based techniques a poor choice for optimization-driven protein design. Contemporary approaches instead perform optimization in a continuous latent representation, but unfortunately the representation metric is generally a poor measure similarity between the represented proteins. This make (global) Bayesian optimization over such latent representations inefficient as commonly applied covariance functions are strongly dependent on the representation metric. Here we argue in favor of using the Jensen-Shannon divergence between the represented protein sequences to define a covariance function over the latent representation. Our exploratory experiments indicate that this kernel is worth further investigation.

Yevgen Zainchkovskyy · Simon Bartels · Søren Hauberg · Jes Frellsen · Wouter Boomsma
-
(Poster) [ Visit Poster at Spot D3 in Virtual World ]   

Protein engineering has become an important field in biomedicine with application in therapeutics, diagnostics and synthetic biology. Due to the complexity of protein structure de novo computational design remains a difficult problem. As helices are an abundant structural feature and play a vital role in determination of the protein structure, full atom de novo computational design for helices would be an important step. Here, we apply Wasserstein bi-directional Generative Adversarial Networks to generate full atom helical structures. To design for structure or function, we allow the design according to structural constraints and introduce a novel Markov Chain Monte Carlo search mechanism with the encoder such that the generated helices match target "hotspot" residues structures. Our model generates helices matching well to the target hotpots (within 3 Å RMSD) and with near-native geometries for a large fraction of the test cases. We demonstrate that our approach is able to quickly generate structurally plausible solutions, bringing us closer to the final goal of full atom computational protein design.

Xuezhi Xie · Philip Kim
-
(Poster) [ Visit Poster at Spot D2 in Virtual World ]   

We consider the problem of sequence-based drug-target interaction (DTI) prediction, showing that a straightforward deep learning architecture that leverages an appropriately pre-trained protein embedding outperforms state of the art approaches, achieving higher accuracy and an order of magnitude faster training. The protein embedding we use is constructed from language models, trained first on the entire corpus on protein sequences and then on the corpus of protein-protein interactions. This multi-tier pre-training customizes the embedding with implicit protein structure and binding information that is especially useful in few-shot (small training data set) and zero-shot instances (unseen proteins or drugs) and can be extended with additional neural network layers when the training data size allows for greater model complexity. We anticipate such transfer learning approaches will facilitate rapid prototyping of DTI models, especially in low-N scenarios.

Samuel Sledzieski · Bonnie Berger
-
(Poster) [ Visit Poster at Spot D1 in Virtual World ]

Successful development of monoclonal antibodies (mAbs) for therapeutic applications is hindered by developability issues such as low solubility, low thermal stability, high aggregation, and high immunogenicity. The discovery of more developable mAb candidates relies on high-quality antibody libraries for isolating candidates with desirable properties. We present Immunoglobulin Language Model (IgLM), a deep generative language model for generating synthetic libraries by re-designing variable-length spans of antibody sequences. IgLM formulates antibody design as an autoregressive sequence generation task based on text-infilling in natural language. We trained IgLM on approximately 558M antibody heavy- and light-chain variable sequences, conditioning on each sequence’s chain type and species-of-origin. We demonstrate that IgLM can be applied to generate synthetic libraries that may accelerate the discovery of therapeutic antibody candidates.

Richard Shuai · Jeffrey Ruffolo
-
(Poster) [ Visit Poster at Spot D0 in Virtual World ]   

Computational protein design has the potential to deliver novel molecular structures, binders, and catalysts for myriad applications. Recent neural graph-based models that use backbone coordinate-derived features show exceptional performance on native sequence recovery tasks and are promising frameworks for design. A statistical framework for modeling protein sequence landscapes using Tertiary Motifs (TERMs), compact units of recurring structure in proteins, has also demonstrated good performance on protein design tasks. In this work, we investigate the use of TERM-derived data as features in neural protein design frameworks. Our graph-based architecture, TERMinator, incorporates TERM-based and coordinate-based information and outputs a Potts model over sequence space. TERMinator outperforms state-of-the-art models on native sequence recovery tasks, suggesting that utilizing TERM-based and coordinate-based features together is beneficial for protein design.

Alex Li · Amy Keating
-
(Poster) [ Visit Poster at Spot C3 in Virtual World ]   

Recently a number of works have demonstrated successful applications of a fully data-driven approach to protein design, based on learning generative models of the distribution of a family of evolutionarily related sequences. Language modelling techniques promise to generalise this design paradigm across protein space, however have for the most part neglected the rich evolutionary signal in multiple sequence alignments and relied on fine-tuning to adapt the learned distribution to a particular family. Inspired by the recent development of alignment-based language models, exemplified by the MSA Transformer, we propose a novel alignment-based generative model which combines an input MSA encoder with an autoregressive sequence decoder, yielding a generative sequence model which can be explicitly conditioned on evolutionary context. To test the benefits of this generative MSA-based approach in design-relevant settings we focus on the problem of unsupervised fitness landscape modelling. Across three unusually diverse fitness landscapes, we find evidence that directly modelling the distribution over full sequence space leads to improved unsupervised prediction of variant fitness compared to scores computed with non-generative masked language models. We believe that combining explicit encoding of evolutionary information with a generative decoder's representation of a distribution over sequence space provides a powerful framework generalising traditional family-based generative models.

Alex Hawkins · Brooks Paige
-
(Poster) [ Visit Poster at Spot C2 in Virtual World ]   

Modeling of protein side-chain conformations is a long-standing subproblem in protein structure prediction. It helps to refine experimental structures with poor resolution, and is used for sampling side chains in computational protein design. Related studies date back to the 1980s starting from statistically analyzing side-chain conformations, developing energy functions, and implementing algorithms for decomposing the side-chain interaction graph as subgraphs such as in SCWRL4. Here, we employ a geometric deep-learning method Relation-Shape Convolution (RSConv), originally applied to point clouds, to the side-chain problem. With features consisting of the backbone atom Cartesian coordinates (in a local frame), backbone dihedral angles, and residues types of neighbors, we achieve a favorable testing set accuracy of the chi1 dihedral angles of 89% (within 40° of the native structure) and chi2 accuracy of 83% given correct chi1 angles. Our prediction accuracy strongly correlates with the experimental atomic displacement B-factors of the side chains. The chi1 dihedrals with B-factor less than 30° representing about 53% of all side chains have prediction accuracy of 93%. The 93% rate is comparable to the chi1 accuracy in AlphaFold2 when it achieves high backbone structure recovery (100 IDDT Cα).

Xiyao Long
-
(Poster) [ Visit Poster at Spot C1 in Virtual World ]   

Although an explosive number of protein structures are revealed each year, the number of basic protein architecture - protein folds - stays stable. Because of the determining relationship between function and structure, it remains highly interesting to scientists to explore protein structure space and subsequently enrich the diversity of protein function space. Current protein structure exploration approaches either rely on sampling of natural protein fragments or require human-crafted constraints. To facilitate more emancipated structure space probing, we present an automated adaptive optimization toolkit for de novo protein fold design - AutoFoldFinder. We also further introduce CM-align to better quantify structure map similarity in the optimization process. Our results indicate a higher efficiency to produce novel yet biologically and physically meaningful folds compared with state-of-the-art methods, increasing novel fold reconstruction rate by 27.3%.

Shuhao Zhang · Youjun Xu
-
(Poster) [ Visit Poster at Spot C0 in Virtual World ]

Proteins undergo structural fluctuations in vivo which can lead to the formation of pockets unseen in the native, folded structural state (i.e. “cryptic pockets”). Inferring cryptic pockets from experimentally determined protein structures is valuable when developing a drug since ligands typically require a pocket for tight binding. Toward this end, many studies employ molecular dynamics simulations to model protein structural fluctuations, but these simulations often require 100s of GPU hours. We hypothesized that machine learning algorithms that predict sites of cryptic pockets directly from folded structures can speed this up. Here, we adapt a graph neural network architecture, which previously achieved state-of-the-art performance on protein structure learning tasks, to predict sites of cryptic pocket formation from experimental protein structures. We trained this model by re-purposing an existing molecular simulation dataset that was generated to identify cryptic pockets in SARS-CoV-2 proteins. Our model achieves good performance (AUC=0.78) on a held-out test set of protein structures with ligands bound to cryptic sites and requires <1 second of compute on a single GPU.

Artur Meller · Michael Ward · Juan Lavista Ferres · Greg Bowman
-
(Poster) [ Visit Poster at Spot B3 in Virtual World ]

Cryo-electron microscopy (cryo-EM) has revolutionized experimental protein structure determination. Despite advances in high resolution reconstruction, a majority of cryo-EM experiments provide either a single state of the studied macromolecule, or a relatively small number of its conformations. This reduces the effectiveness of the technique for proteins with flexible regions, which are known to play a key role in protein function. Recent methods for capturing conformational heterogeneity in cryo-EM data model the volume space, making recovery of continuous atomic structures challenging. Here we present a fully deep-learning-based approach using variational auto-encoders (VAEs) to recover a continuous distribution of atomic protein structures and poses directly from picked particle images and demonstrate its efficacy on realistic simulated data. We hope that methods built on this work will allow incorporation of stronger prior information about protein structure and enable better understanding of non-rigid protein structures.

Dan Rosenbaum · Marta Garnelo · Charles Beattie · Andrea Huber · Pushmeet Kohli · Andrew Senior · John Jumper · Carl Doersch · Ali Eslami · Olaf Ronneberger · Jonas Adler
-
(Poster) [ Visit Poster at Spot B2 in Virtual World ]

Structure-based drug design involves finding ligand molecules that exhibit structural and chemical complementarity to protein pockets. Deep generative methods have shown promise in proposing novel molecules from scratch (de-novo design), avoiding exhaustive virtual screening of chemical space. Most generative de-novo models fail to incorporate detailed ligand-protein interactions and 3D pocket structures. We propose a novel supervised model that generates molecular graphs jointly with 3D pose in a discretised molecular space. Molecules are built atom-by-atom inside pockets, guided by structural information from crystallographic data. We evaluate our model using a docking benchmark and find that guided generation improves predicted binding affinities by 8% and drug-likeness scores by 10% over the baseline. Furthermore, our model proposes molecules with binding scores exceeding some known ligands, which could be useful in future wet-lab studies.

Pavol Drotar · Arian Jamasb · Ben Day · Catalina Cangea · Pietro Lió
-
(Poster) [ Visit Poster at Spot B1 in Virtual World ]

In response to pathogens, the adaptive immune system generates specific antibodies that bind and neutralize foreign antigens. Understanding the composition of an individual's immune repertoire can provide insights into this process and reveal potential therapeutic antibodies. In this work, we explore the application of antibody-specific language models to aid understanding of immune repertoires. We introduce AntiBERTy, a language model trained on 558M natural antibody sequences. We find that within repertoires, our model clusters antibodies into trajectories resembling affinity maturation. Importantly, we show that models trained to predict highly redundant sequences under a multiple instance learning framework identify key binding residues in the process. With further development, the methods presented here will provide new insights into antigen binding from repertoire sequences alone.

Jeffrey Ruffolo · Jeremias Sulam
-
(Poster) [ Visit Poster at Spot B0 in Virtual World ]

Deep learning-based object detection methods have shown promising results in various fields ranging from autonomous driving to video surveillance where input images have relatively high signal-to-noise ratios (SNR). On low SNR images such as biological electron microscopy (EM) data, however, the performance of these algorithms is significantly lower. Moreover, biological data typically lacks standardized annotations further complicating the training of detection algorithms. Accurate identification of proteins from EM images is a critical task, as the detected positions serve as inputs for the downstream 3D structure determination process. To overcome the low SNR and lack of annotations, we propose a joint weakly-supervised learning framework that performs image denoising while detecting objects of interest. Our framework denoises images without the need of clean images and is able to detect particles of interest even when less than 0.5% of the data are labeled. We validate our approach on three extremely low SNR cryo-EM datasets and show that our strategy outperforms existing state-of-the-art (SofA) methods used in the cryo-EM field by a significant margin.

Wendy Huang · Alberto Bartesaghi · Ye Zhou
-
(Poster) [ Visit Poster at Spot A3 in Virtual World ]

We explore the use of modern variational autoencoders for generating protein structures. Models are trained across a diverse set of natural protein domains. Three-dimensional structures are encoded implicitly in the form of an energy function that expresses constraints on pairwise distances and angles. Atomic coordinates are recovered by optimizing the parameters of a rigid body representation of the protein chain to fit the constraints. The model generates diverse structures across a variety of folds, and exhibits local coherence at the level of secondary structure, generating alpha helices and beta sheets, as well as globally coherent tertiary structure. A number of generated protein sequences have high confidence predictions by AlphaFold that agree with their designs. The majority of these have no significant sequence homology to natural proteins.

Zeming Lin · Tom Sercu · yann lecun · Alex Rives
-
(Poster) [ Visit Poster at Spot A2 in Virtual World ]

Multiple Sequence Alignments (MSAs) of homologous sequences contain information on structural and functional constraints and their evolutionary histories. Despite their importance for many downstream tasks, such as structure prediction, MSA generation is often treated as a separate pre-processing step, without any guidance from the application it will be used for. Here, we implement a smooth and differentiable version of the Smith-Waterman pairwise alignment algorithm that enables jointly learning an MSA and a downstream machine learning system in an end-to-end fashion. To demonstrate its utility, we introduce SMURF (Smooth Markov Unaligned Random Field), a new method that jointly learns an alignment and the parameters of a Markov Random Field for unsupervised contact prediction. We find that SMURF mildly improves contact prediction on a diverse set of protein and RNA families. In another application, we demonstrate that connecting our differentiable alignment module to AlphaFold2 and optimizing the learnable alignments leads to improved structure predictions with a higher confidence. This work highlights the potential of differentiable dynamic programming to improve neural network pipelines that rely on an alignment.

Samantha Petti · Nicholas Bhattacharya · Roshan Rao · Neil Thomas · Alexander Rush · Peter Koo · Sergey Ovchinnikov
-
(Poster) [ Visit Poster at Spot A1 in Virtual World ]

In drug discovery, structure-based high-throughput virtual screening (vHTS) campaigns aim to identify bioactive ligands or 'hits' for therapeutic protein targets from docked poses at specific binding sites. However, while generally successful at this task, many deep learning methods are known to be insensitive to protein-ligand interactions, decreasing the reliability of hit detection and hindering discovery at novel binding sites. Here, we overcome this limitation by introducing a class of models with two key features: 1) we condition bioactivity on pose quality score, and 2) we present poor poses of true binders to the model as negative examples. The conditioning forces the model to learn details of physical interactions. We evaluate these models on a new benchmark designed to detect pose-sensitivity.

Pawel Gniewek · Kate Stafford
-
(Poster) [ Visit Poster at Spot A0 in Virtual World ]

Protein design is challenging because it requires searching through a vast combinatorial space that is only sparsely functional. Self-supervised learning approaches offer the potential to navigate through this space more effectively and thereby accelerate protein engineering. We introduce a sequence denoising autoencoder (DAE) that learns the manifold of protein sequences from a large amount of potentially unlabelled proteins. This DAE is combined with a function predictor that guides sampling towards sequences with higher levels of desired functions. We train the sequence DAE on more than 20M unlabeled protein sequences spanning many evolutionarily diverse protein families and train the function predictor on approximately 0.5M sequences with known function labels. At test time, we sample from the model by iteratively denoising a sequence while exploiting the gradients from the function predictor. We present a few preliminary case studies of protein design that demonstrate the effectiveness of this proposed approach, which we refer to as “deep manifold sampling”, including metal binding site addition, function-preserving diversification, and global fold change.

Vladimir Gligorijevic · Stephen Ra

Author Information

Ellen Zhong (Massachusetts Institute of Technology)
Raphael Townshend (Stanford University)
Stephan Eismann (Stanford University)
Namrata Anand (Stanford University)
Roshan Rao (UC Berkeley)
John Ingraham (Generate Biomedicines)
Wouter Boomsma (University of Copenhagen)
Sergey Ovchinnikov (Harvard)
Bonnie Berger (MIT)

More from the Same Authors