Skip to yearly menu bar Skip to main content

Workshop: Machine Learning in Structural Biology

MSA-Conditioned Generative Protein Language Models for Fitness Landscape Modelling and Design

Alex Hawkins-Hooker · David Jones · Brooks Paige


Recently a number of works have demonstrated successful applications of a fully data-driven approach to protein design, based on learning generative models of the distribution of a family of evolutionarily related sequences. Language modelling techniques promise to generalise this design paradigm across protein space, however have for the most part neglected the rich evolutionary signal in multiple sequence alignments and relied on fine-tuning to adapt the learned distribution to a particular family. Inspired by the recent development of alignment-based language models, exemplified by the MSA Transformer, we propose a novel alignment-based generative model which combines an input MSA encoder with an autoregressive sequence decoder, yielding a generative sequence model which can be explicitly conditioned on evolutionary context. To test the benefits of this generative MSA-based approach in design-relevant settings we focus on the problem of unsupervised fitness landscape modelling. Across three unusually diverse fitness landscapes, we find evidence that directly modelling the distribution over full sequence space leads to improved unsupervised prediction of variant fitness compared to scores computed with non-generative masked language models. We believe that combining explicit encoding of evolutionary information with a generative decoder's representation of a distribution over sequence space provides a powerful framework generalising traditional family-based generative models.

Chat is not available.