Skip to yearly menu bar Skip to main content

Workshop: Machine Learning in Structural Biology Workshop

Frame2seq: structure-conditioned masked language modeling for protein sequence design

Deniz Akpinaroglu · Kosuke Seki · Eleanor Zhu · Tanja Kortemme


Machine learning has revolutionized computational protein design, enabling significant progress in protein backbone generation and sequence design. For protein sequence design, encoder-decoder models have achieved state-of-the-art accuracy, which has translated to experimental success. Here, we introduce Frame2seq, a structure-conditioned masked language model for protein sequence design that, in contrast to the autoregressive methods, generates sequences in a single pass. On the CATH 4.2 test dataset, Frame2seq outperforms the state-of-the-art autoregressive method, ProteinMPNN, achieving 49.1% sequence recovery (2.0% improvement) with over six times faster inference. In addition, Frame2seq accurately estimates the error in its own predictions across diverse backbones. To expand design tasks beyond native-like sequence space, we use Frame2seq to generate low sequence identity designs for de novo backbones. Through experimental characterization, we show that Frame2seq successfully designs soluble, monomeric, stable proteins with low sequence identity to native. The speed and accuracy of Frame2seq will accelerate exploration of novel sequence space across diverse design tasks, including challenging applications such as multi-objective optimization.

Chat is not available.