MonitorLLM: Real-Time Structural and Bias Evaluation of Generative AI through Knowledge Graphs
Abstract
Large Language Models (LLMs) provide remarkable generative capabilities but remain vulnerable to hallucinations, semantic drift, and biased framings. Existing evaluation methods are static and dataset-bound, offering limited insight into how models evolve under real-world conditions. We present \textbf{MonitorLLM}, a knowledge graph–based framework for continuous and interpretable evaluation of generative AI. MonitorLLM compares deterministic, ontology-driven graphs with LLM-generated graphs from live news streams, quantifying deviations through schema-aware structural metrics and hallucination checks. To extend beyond factual reliability, we introduce a \textit{Generalized Prompt Framework} (GPF) that probes diverse demographic, socioeconomic, and political groups, enabling the construction of bias-aware knowledge graphs and dissimilarity metrics. An adaptive anomaly detector integrates both structural and bias dimensions, capturing temporal drift and reliability shifts. Experiments across nine LLMs demonstrate that MonitorLLM highlights model stability, surfaces hallucinations, and reveals disparities in group-conditioned framings, offering a vendor-agnostic and auditable path toward trustworthy deployment of generative AI.