Fake It Till You Make It: Multi-Physics Synthesis Breaks the Data Barrier in Chemical Language Models
Abstract
Chemical language models (CLMs) promise transformative capabilities in polymer property prediction and design, yet their potential is hindered by the scarcity of experimental data. We present Poly4mer, a novel multi-physics synthesis based CLM framework that units neural polymer representation learning with theoretical foundations in physics. Poly4mer integrates a comprehensive group contribution method for hypothetical polymer generation with physics-based simulations that faithfully emulate experimental protocols, fabricating a rich dataset of synthetic polymer structures and structure–property relationships to establish strong priors for CLM training. This synthetic data enables training of two critical components: an encoder-decoder architecture that captures polymer semantics, and a two-phase property prediction strategy comprising supervised pretraining on synthetic data for physically consistent alignment followed by fine-tuning on experimental measurements to enhance predictive accuracy. We then architect an autoencoding system that couples predictive capability with latent decoding, enabling inverse design of polymers optimized for downstream applications through latent space exploration and structure reconstruction. By faking data with physics before reality catches up, we demonstrate that multi-physics synthesis can break the data barrier in CLMs, establishing a new paradigm for physics-grounded neural polymer discovery.