Demystifying Protein Generation with Hierarchical Conditional Diffusion Models
Abstract
Generating novel and functional protein sequences is critical to a wide range of applications in biology. Recent advancements in conditional diffusion models have shown impressive empirical performance in protein generation tasks. However, reliable generation of proteins remains an open research question in de novo protein design, especially when it comes to conditional diffusion models. Considering the biological function of a protein is determinedby multi-level structures, we propose a novel multi-level conditional diffusion model that integrates both sequence-basedand structure-based information for efficient end-to-end protein design guided byspecified functions. By generating representations at different levels simultaneously, our framework can effectively model the inherent hierarchical relations between different levels, resulting in an informative anddiscriminative representation of the generated protein. We also propose Protein-MMD (Maximum Mean Discrepancy), a new reliable evaluation metric, to evaluate the quality of generated protein with conditional diffusion models. Our new metric is able to capture both distributional and functional similarities between real and generated protein sequences while ensuring conditional consistency. Using conditional protein generation tasks with benchmark datasets, we demonstrate the efficacy of the proposed protein generation framework and evaluation metric.