Skip to yearly menu bar Skip to main content

Workshop: Machine Learning in Structural Biology Workshop

Seq2MSA: A Language Model for Protein Sequence Diversification

Pascal Sturmfels · Roshan Rao · Robert Verkuil · Zeming Lin · Tom Sercu · Adam Lerer · Alex Rives


Diversification libraries of protein sequences that contain a similar set of structures over a variety of sequences can help protein design pipelines by introducing flexibility into the starting structures and providing a range of starting points for directed evolution. However, exploring the sequence space is computationally challenging: the vast majority of sequence space is non-viable, and even of those sequences that do fold to well-formed protein structures, it is challenging to find the fraction that maintain a similar fold class to a given protein. In this work, we propose to use an encoder-decoder language model, trained on a novel Seq2MSA task, that can create diversification libraries of any input protein. In particular, using our model, we are able to generate sequences that maintain structural similarity to a target sequence while pushing below 40% sequence identity to any protein in UniRef. Our diversification pipeline has the potential to aid in computational protein design by providing a diverse set of starting points in sequence space for a given functional or structural target.

Chat is not available.