OpenGovCorpus: Evaluating LLMs on Citizen Query Tasks
Neil Majithia · Rajat Shinde · Manil Maskey · Elena Simperl
Abstract
"Citizen queries" are questions about government policies, guidance, and services relevant to an individual's circumstances. LLM-powered chatbots have a number of strengths that make them the obvious future for citizen query-answering, but hallucinated or outdated answers can cause significant harm to askers in such a sensitive context. We introduce OpenGovCorpus-UK and OpenGovCorpus-eval: a 7.5k-Q\&A-pair benchmark synthesized from $gov.uk$, and its use in an evaluation framework for LLMs in citizen-query tasks. The protocol spans three evaluator classes ((1) open-weights models, (2) GPT-family models, and (3) human judgment) combining a persona-aware Metadata Grader, embedding- and token-level Semantic Similarity, and \LLM-as-a-Judge with pass-rate aggregation. Results show strong few-shot gains, context and persona mismatches not captured by similarity metrics alone, and variation across families of closed/open models. We provide a reproducible procedure and thresholds suitable for lifecycle monitoring as policies and models evolve, supporting evidence-based public sector deployment for the future of trustworthy LLMs in government services.
Chat is not available.
Successful Page Load