Spotlight
in
Workshop: UniReps: Unifying Representations in Neural Models

Characterizing pre-trained and task-adapted molecular representations

Celia Cintas ⋅ Payel Das ⋅ Jarret Ross ⋅ Brian Belgodere ⋅ Girmaw Abebe Tadesse ⋅ Vijil Chenthamarakshan ⋅ Jannis Born ⋅ Skyler D. Speakman

Project Page [ OpenReview]

Abstract

Pre-trained deep learning models are emerging fast as a tool for enhancing scientific workflow and accelerating scientific discovery. Representation learning is a fundamental task to study the molecular structure–property relationship, which is then leveraged for predicting the molecular properties or designing new molecules with desired attributes. However, evaluating the emerging "zoo" of pre-trained models for various downstream tasks remains challenging. We propose an unsupervised method to characterize embeddings of pre-trained models through the lens of non-parametric group property-driven subset scanning (SS). We assess its detection capabilities with extensive experiments on diverse molecular benchmarks (ZINC-250K, MOSES, MoleculeNet) across predictive chemical language models (MoLFormer, ChemBERTa) and molecular graph generative models (GraphAF, GCPN). We further evaluate how representations evolve as a result of domain adaptation by finetuning or low-dimensional projection.Experiments reveal notable information condensation in the pre-trained embeddings upon task-specific fine-tuning as well as projection techniques. For example, among the top-$120$ most-common elements in the embedding (out of $\approx 700$), only $11$ property-driven elements are shared between the three tasks (BACE, BBBP, and HIV), while $\approx 70$-$80$ of those are unique to each task. This work provides a post-hoc quality evaluation method for representation learning models and domain adaptation methods that is task and modality-agnostic.

Video

Chat is not available.