Timezone: »

Standards, tooling and benchmarks to probe representation learning on proteins
Joaquin Gomez Sanchez · Sebastian Franz · Michael Heinzinger · Burkhard Rost · Christian Dallago
Event URL: https://openreview.net/forum?id=adODyN-eeJ8 »

With the advent of novel foundational approaches to represent proteins, a race to evaluate and assess their effectiveness to embed biological data for a variety of downstream tasks, from structure prediction to protein engineering, has gained tremendous traction. While tasks like protein 3D structure prediction from sequence have well characterized datasets and methodological approaches, many others, for instance probing the ability to encode protein function from sequence, lack standardization. This becomes particularly relevant when employing experimental biological datasets for machine learning, as curating biologically meaningful data splits requires biological intuition, whilst engineering appropriate machine learning models requires data science expertise. Gold standard experimental datasets annotated with machine learning relevant metadata are thus scarce and often scattered in different file formats in the literature, using a variety of metrics to measure success, hindering rapid evaluation of new foundational representation techniques or machine learning models built on top of them. To address these challenges, we propose a suite of solutions including a) standards for sequence datasets and embedding interfaces, b) curated and machine learning metadata annotated protein sequence datasets, c) machine learning architectures and training scripts, and d) an extensible, automatic evaluation pipeline connecting all these components. In practice, we described new, broad data standards for machine learning protein sequence datasets, including definitions for predictions of a categorical attribute for a residue in a sequence (e.g., secondary structure), or predicting a single value for the entire sequence (e.g., protein fitness). We expanded a previous collection of datasets for protein engineering (FLIP) by adding five traditional tasks from the literature, like residue secondary structure, residue conservation, and protein subcellular location prediction. We created a novel software solution (biotrainer) that collects machine learning architectures used for protein predictions and exposes a reproducible training pipeline that can consume any dataset adhering to the newly proposed data standards. Lastly, we connected all components in a new software solution (autoeval), which collects definitions for embedding methods, datasets and downstream machine learning models to automatically evaluate them. With these solutions, biological experimentalists can contribute new datasets and even train standard models using popular embedding methods, while machine learning researchers can easily plug in new foundational models or architectures in a common interface and test them on a variety of tasks against other solutions. In turn, the combination of solutions presented here unlocks the ability of interest groups to create challenges around new biological datasets, new machine learning architectures, new foundational models, or a combination thereof.

Author Information

Joaquin Gomez Sanchez (Technische Universität München)
Sebastian Franz (Technische Universität München)
Michael Heinzinger (Technical University of Munich)
Burkhard Rost (TU Munich)
Christian Dallago (NVIDIA & Technical University of Munich)

More from the Same Authors