MEGA: A Large-Scale Molecular Editing Dataset for Guided-Action Optimization
Abstract
Large language models show strong potential for molecular editing, but progress has been constrained by the limited scale and quality of available training data. To address this, we introduce MEGA, a large-scale dataset of 31.4 million molecule pairs, where each pair represents a single property-improving chemical edit annotated with an explicit action: Replace, Insert, or Delete. We demonstrate MEGA’s utility in a controlled supervised fine-tuning (SFT) setting, where a model trained on MEGA outperforms models trained on existing datasets by up to +21.47 percentage points in hit ratio. Furthermore, we show that Group Relative Policy Optimization (GRPO) post-training with a similarity-aware reward achieves state-of-the-art performance and a remarkable∼36×improvement in data efficiency, while also preserving edit locality. We release MEGA in open access to the community to enable data-centric benchmarks and accelerate progress in molecular editing with generative models.