Mol-SGCL: Molecular Substructure-Guided Contrastive Learning for Out-of-Distribution Generalization
Abstract
Datasets for molecular property prediction are small relative to the vast chemical space, making generalization from limited experiments a central challenge. We present Mol-SGCL, Molecular Substructure-Guided Contrastive Learning, a method that shapes the latent space of molecular property prediction models to align with science-based priors. We hypothesize that engineering inductive biases directly into the representation space encourages models to learn chemical principles rather than overfitting to spurious correlations. Concretely, Mol-SGCL employs a triplet loss that pulls a molecule’s representation toward representations of plausibly causal substructures and pushes it away from implausibly causal ones. Plausibility is either defined by domain-specific rules in Mol-SGCLRules or by a large language model in Mol-SGCLLLM. To stress-test out-of-distribution (OOD) generalization under data scarcity, we construct modified Therapeutics Data Commons tasks that minimize train–test similarity and cap the training set at 150 molecules. On these OOD splits, both Mol-SGCLLLM and Mol-SGCLRules consistently outperform baselines, indicating that Mol-SGCL promotes invariant feature learning and enhances model generalizability in data-limited regimes. We further demonstrate that Mol-SGCL transfers successfully to Minimol, a state-of-the-art molecular property prediction model, highlighting that the approach is not tied to a specific architecture. We envision that Mol-SGCL could be extended beyond molecular property prediction to any setting where inputs can be decomposed into substructures whose presence, absence, or configuration has a causal influence on the target label.