Mol-SGCL: Molecular Substructure-Guided Contrastive Learning for Out-of-Distribution Generalization
Abstract
Datasets for molecular property prediction are small relative to the vast chemical space, making generalization from limited experiments a central challenge. We present Mol-SGCL -- Molecular Substructure-Guided Contrastive Learning -- a method that shapes the latent space of molecular property prediction models to align with science-based priors. We hypothesize that engineering inductive biases directly into the representation space encourages models to learn chemical principles rather than overfitting to spurious correlations. Concretely, Mol-SGCL employs a triplet loss that pulls a molecule’s representation toward representations of plausibly causal substructures and pushes it away from implausibly causal ones. Plausibility is defined by querying a large language model with the list of extracted substructures. To stress-test out-of-distribution (OOD) generalization under data scarcity, we construct modified Therapeutics Data Commons tasks that minimize train–test similarity and cap the training set at 150 molecules. On these OOD splits, Mol-SGCL outperforms baselines, indicating that Mol-SGCL promotes invariant feature learning and enhances model generalizability in data-limited regimes. We further demonstrate that Mol-SGCL transfers successfully to a D-MPNN backbone, highlighting that the approach is not tied to a specific architecture. We imagine that Mol-SGCL may enhance AI drug discovery by improving molecular property prediction for novel candidate molecules that are out-of-distribution from existing data. Beyond molecular property prediction, we envision that this could be extended to diverse therapeutics tasks, as long as the inputs can be decomposed into substructures whose presence, absence, or configuration has an influence on the target label.