ColliderML: A High-Luminosity Detector Simulation Dataset for Machine Learning Benchmarks
Abstract
We present ColliderML, a large-scale, fully simulated benchmark dataset of proton–proton collisions at HL-LHC conditions, designed to bridge the gap between realistic detector-level data and the needs of modern machine learning. Built on the validated OpenDataDetector geometry, ColliderML provides one million events at mu=200 pileup, spanning ten Standard Model and Beyond Standard Model processes, together with extensive single-particle samples, digitised with realistic detector response and reconstructed using standard toolkits. The release delivers both low-level detector outputs and higher-level physics objects, enabling studies that range from track finding and calorimeter clustering to jet reconstruction and end-to-end foundation models. Unlike previous public datasets—typically limited to fast simulations or small-scale low-level challenges—ColliderML combines scale, realism, and breadth, while preserving truth information and pile-up structure down-sampling and scaling studies of ML training and inference. Two complementary releases are provided: one with tracker hits, calorimeter hits and reconstructed tracks, and one extended with calorimeter clusters, particle-flow objects, and jets, allowing direct comparison between traditional baselines and ML-driven approaches. Access is supported through both ROOT/EDM4hep and lightweight HDF5 formats, together with a Python library for download, manipulation, and reproducibility. With this dataset we aim to establish a new standard for detector-level ML benchmarks, opening the path toward realistic, large-scale, and interpretable studies in collider physics.