Electron-Proton Scattering Event Generation using Structured Tokenization
Steven Goldenberg · Diana McSpadden · Nobuo Sato · Kostas Orginos · Robert Edwards
Abstract
Recent work such as Omnijet-$\alpha$ has demonstrated that effective tokenization combined with transformer-based architectures can produce effective foundation models for jet physics. While tokenization may help models capture generalizable event characteristics, it also introduces discretization errors that may compromise the precision required for downstream physics analyses. As the number and complexity of the particle features grow, these errors are likely to grow proportionally. In this study, we investigate new tokenization strategies to improve the application of generative transformer models to Pythia8 simulations of electron-proton scattering at the Electron-Ion Collider. Specifically, we propose a feature-based structured tokenization approach that utilizes multiple tokens per particle, improving expressivity, while reducing the total number of unique tokens needed. We evaluate this method against grid-based binning, K-means clustering, and vector-quantized variational auto-encoders on event simulations. Our results show that feature-based structured tokenization reduces discretization error, leading to more accurate generative modeling of particle-level events.
Chat is not available.
Successful Page Load