Skip to yearly menu bar Skip to main content

Workshop: Machine Learning in Structural Biology Workshop

Sampling Protein Language Models for Functional Protein Design

Jeremie Theddy Darmawan · Yarin Gal · Pascal Notin


Protein language models have emerged as powerful ways to learn complex representations of proteins, thereby improving their performance on several downstream tasks, from structure prediction to fitness prediction, property prediction, homology detection, and more. By learning a distribution over protein sequences, they are also very promising tools for designing novel and functional proteins, with broad applications in healthcare, new material, or sustainability. Given the vastness of the corresponding sample space, efficient exploration methods are critical to the success of protein engineering efforts. However, the methodologies for adequately sampling these models to achieve core protein design objectives remain underexplored and have predominantly leaned on techniques developed for Natural Language Processing. In this work, we first develop a holistic in silico protein design evaluation framework, to comprehensively compare different sampling methods. After performing a thorough review of sampling methods for language models, we introduce several sampling strategies tailored to protein design. Lastly, we compare the various strategies on our in silico benchmark, investigating the effects of key hyperparameters and highlighting practical guidance on the relative strengths of different methods.

Chat is not available.