Skip to yearly menu bar Skip to main content

Workshop: Machine Learning in Structural Biology Workshop

Structure-based and leakage-free data splits for rigorous protein function evaluation

Charlotte Rochereau · Mohammed AlQuraishi · Arthur Valentin · Gergo Nikolenyi


Datasets for protein machine learning tasks are typically constructed by splitting protein sequences between train, validation and test sets based on protein sequence similarity. For tasks largely determined by protein structure, such as protein function prediction, we hypothesize that such data splitting may cause data leakage between the sets, since proteins can be structurally and functionally similar but still have dissimilar sequences. As a result, model performance on low sequence similarity levels could be overestimated. We demonstrate that this is the case on a commonly used enzyme dataset, by introducing a novel dataset construction methodology designed to prevent leakage between sets based on i) using structure similarity instead of sequence similarity to cluster proteins, and ii) generating tight protein clusters using community detection. Additionally, we demonstrate that simple models based on protein language model representations provide powerful baselines for the task of protein function prediction.

Chat is not available.