CiteGuard: Retrieval-Augmented Citation Verification for LLM-Powered Peer Review
Abstract
Accurate citations are essential for reproducibility and cumulative scientific progress, yet citation errors remain common and rarely receive systematic scrutiny in automated reviewing workflows. We introduce CiteGuard, a fast and auditable citation verifier that combines high-coverage retrieval with scientific-domain embeddings and lightweight LLM adjudication. CiteGuard extracts every in-text citation, retrieves candidate sources via a BM25+SPECTER2 fusion, and computes an interpretable alignment score that aggregates DOI agreement, robust title similarity, SPECTER2 semantic similarity, and venue/year compatibility. The score is calibrated to probability with isotonic regression and only uncertain cases are escalated to a small language model for a deterministic judgment. Evaluated on RealCitationErrors-500 (500 arXiv/PMC papers; 7,221 citations; 813 errors), CiteGuard achieves paper-level F1=0.95 and citation-level P=0.82, R=0.97, F1=0.89±0.02 (95% cluster bootstrap over papers), outperforming strong retrieval and LLM baselines while maintaining high precision. Median end-to-end latency is 11.7 s per paper with 18% of citations escalated; median per-review cost is USD 0.0028 under July 2025 small-LLM pricing. A within-subject user study (n=28) prefers reviews augmented with CiteGuard in 72% of blinded comparisons (Wilcoxon signed-rank p=0.007, Cliff’s δ=0.62). An ablation analysis indicates that SPECTER2 and multi-hit retrieval primarily drive recall, while calibrated escalation improves precision. Performance declines on low-resource humanities texts (F1=0.76), motivating domain adaptation. We provide an anonymized artifact for reproduction in the supplement and will release implementation details upon acceptance; code will be shared upon reasonable request post-acceptance.