Few-Label SetFit for C2C Online Marketplace Listings: Multi-Label Classification and Entity Extraction for Identifying Potentially Illicit Ads
Abstract
The proliferation of illicit online advertisements poses a growing challenge for law enforcement and public safety, demanding methods that are both accurate and data-efficient. This paper investigates the use of \texttt{SetFit}, a prompt-free few-shot framework for fine-tuning sentence transformers, for detecting risk signals in online advertisements. We study two domains of high societal impact: commercial sex advertisements (CSAs) associated with suspected human trafficking activity (HT) and suspected stolen-parts marketplace listings (SCP). For CSAs, we cast the problem as a multi-label post-classification task and evaluate different sentence-aggregation strategies; our best approach, averaging across all sentences, achieves an F1 of \textbf{0.783}, surpassing a GPT-2 baseline trained with more data and significantly larger parameter counts. In contrast, when applied to token-level named-entity recognition for suspected stolen-parts marketplace listings (SCP), SetFit underperforms (F1 = 0.242) relative to GPT-2 (F1 = 0.337), exposing a structural limitation in adapting sentence-level embeddings to fine-grained extraction. These findings demonstrate that SetFit is highly effective for low-resource post classification in sensitive domains, while also motivating new research directions for extending few-shot methods to entity-level modeling.