Optimized Statistical Ranking is All You Need for Robust Coreset Selection in Efficient Transformer-Based Spam Detection
Abstract
Efficient spam detection in resource-constrained environments is challenging due to class imbalance, noisy text, and the computational demands of large-scale Transformer models. We introduce a novel coreset selection strategy based on a unified Uncertainty-Diversity Ranking framework, which prioritizes highly informative and uncertain samples while ensuring diversity and class balance within the selected subset. Our method flexibly supports multiple coreset strategies, including Top-K, Bottom-K, and adaptive class-wise selection, enabling robust performance even when training on a fraction of the data. Extensive experiments on benchmark datasets, including UCI SMS, UTKML Twitter, and LingSpam, demonstrate that our ranking scheme achieves competitive accuracy, precision, and recall while reducing training data by up to 95\%, significantly lowering computational cost. These results highlight the potential of our approach for practical deployment in low-power devices, mobile platforms, and other resource-limited settings.