Skip to yearly menu bar Skip to main content

Workshop: Machine Learning in Structural Biology Workshop

A Benchmark Framework for Evaluating Structure-to-Sequence Models for Protein Design

Jeffrey Chan · Seyone Chithrananda · David Brookes · Sam Sinai


Structure-based \textit{de novo} protein design methods, ESM-IF1 \cite{pmlr-v162-hsu22a} and ProteinMPNN \cite{dauparas2022robust}, have recently shown impressive results in zero-shot fitness prediction, protein sequence recovery, and experimentally-validated protein design. The prospect of utilizing such methods to design better proteins is tantalizing and has already driven experimental work \cite{wicky2022hallucinating}. However, current understanding of when and why these methods perform well or poorly is limited due to a paucity of comprehensive ground-truth experimental data. This makes \textit{in silico} benchmarking and ablation difficult, requiring expensive experimental validation hampering fast feedback loops and rapid methodological development. In this work, we evaluate the capabilities of structure-based methods for protein design against a combinatorially complete fitness landscape measuring stability and binding of the protein G domain B1 \cite{wu2016adaptation}. We develop a framework for protein design that divides into two tasks: generation and ranking. In the case of ESM-IF1 we significantly improve its generation capabilities via distilled conditional language modeling. We find that both methods show impressive generation and ranking results for small experimental budgets but scale poorly to larger budgets. Finally, we demonstrate that modeling protein complexes exhibits minor design improvements for binding affinity tasks.

Chat is not available.