Mechanistic Reaction Data for Interpretable Deep Learning in Chemistry
Abstract
The lack of openly accessible, well-curated reaction databases remains a major obstacle to data-driven research in chemistry. Many existing chemical datasets are proprietary and/or limited to unbalanced overall transformations that map reactants directly to products without revealing underlying mechanisms, intermediates, or byproducts. As a result, machine learning models trained on such data often act as “black boxes,” predicting products without explaining how or why they form. To address this gap, we present the largest and most comprehensive publicly available dataset of manually curated elementary reaction steps, integrated into a platform that supports continuous curation, search functionality, and community contribution at scale. Our datasets cover polar and radical elementary steps, complete mechanistic pathways, and combinatorially generated mechanisms, with each reaction represented as a balanced, canonicalized SMIRKS string with reactive atom mapping and mechanistic annotations. By making mechanistic reaction data widely available, we aim to enable the development of interpretable and more accurate machine learning models for reaction and pathway prediction. The platform will be made available at [URL omitted for anonymity] upon acceptance.