Skip to yearly menu bar Skip to main content

Workshop: Machine Learning in Structural Biology Workshop

Agile Language Transformers for Recombinant Protein Expression Optimization

Jeliazko Jeliazkov · Maxim Shapovalov · Diego del Alamo · Matt Sternke · Joel Karpiak


Language Transformers (LaTs) have achieved state-of-the-art performance in a range of challenging protein modeling tasks including structure prediction, design, mutation effect prediction, and others. The lion's share of these improvements derive from exponential increases in the size and depth of these neural networks, which now routinely exceed billions of trainable parameters, rather than fundamental architectural innovations. This explosive growth in model size poses an obstacle to integration into design-build-test cycles, wherein models are iteratively evaluated, retrained, and improved throughout data collection. As a result, large LaTs do not meet the need for lightweight, rapid-to-train models that excel at problems with tight data-model feedback loops. Here, we present a small, 10 million-parameter BERT model with linearly scaling attention that can be trained from scratch on four Nvidia V100 GPUs in under a week and fine-tuned with full back-propagation in hours to days. We demonstrate that this model excels at two challenging active-learning problems, recombinant protein expression prediction and codon optimization, that require interfacing with experiments. Our approach highlights the size-cost tradeoff inherent to LaTs and demonstrates the utility of small, custom-designed models in practical settings.

Chat is not available.