DMRG Quantum Chemistry Dataset for Multi-Reference Machine Learning
Abstract
The multireference (MR) regime, characterized by near-degenerate orbitals and competing spin states, is central to catalysis but poorly represented in existing datasets, which focus on closed-shell organic molecules. To address this gap, we propose a 50,000-point dataset of density matrix renormalization group (DMRG) calculations targeting MR systems, such as transition-metal diatomics, organic radicals, and metal--ligand complexes. The dataset includes ground- and excited-state energies, spin densities, reduced density matrices, and entanglement metrics, enabling models to learn structure--property relationships and correlation-sensitive observables critical for catalyst design. Generated using an automated DMRG-SCF pipeline with open-source tools (pySCF and block2), the dataset requires approximately one million CPU-hours ($20k) and single-digit terabytes of storage. By providing high-fidelity MR data, this corpus will enhance ML-driven discovery of efficient catalysts for greener chemical processes, such as ammonia synthesis, supporting a carbon-neutral future.