Poster
in
Workshop: Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling

An Evaluation Study of Hybrid Methods for Multilingual PII Detection

Harshit Rajgarhia · Suryam Gupta · Asif Shaik · Praveen Kumar Gulipalli · Y Santhoshraj · Sanka Nishitha · Abhishek Mukherji

Project Page [ Poster] [ OpenReview]

Abstract

The detection of Personally Identifiable Information (PII) is critical for privacycompliance but remains challenging in low-resource languages due to linguisticdiversity and limited annotated data. We present RECAP, a hybrid frameworkthat combines deterministic regular expressions with context-aware large languagemodels (LLMs) for scalable PII detection across 13 low-resource locales. RECAP’smodular design supports over 300 entity types without retraining, using a three-phase refinement pipeline for disambiguation and filtering. Benchmarked withnervaluate, our system outperforms fine-tuned NER models by 82% and zero-shot LLMs by 17% in weighted F1-score. This work offers a scalable and adaptablesolution for efficient PII detection in compliance-focused applications.

Chat is not available.