STED and Consistency Scoring: A Framework for Evaluating LLM Structured Output Reliability
Guanghui Wang · Jinze Yu · Xing Zhang · Dayuan jiang · Yin Song · Peiyang He · Xuefeng Liu · Tomal Deb
Abstract
Large Language Models (LLMs) are increasingly deployed for structured data generation tasks, yet their output consistency remains a critical challenge for production applications. We introduce a comprehensive framework for evaluating and improving consistency in LLM-generated structured outputs. Our approach combines two key contributions: (1) STED (Semantic Tree Edit Distance), a novel similarity metric that balances semantic flexibility with structural strictness when comparing JSON outputs, and (2) a consistency scoring framework that aggregates multiple STED measurements across repeated generations to quantify output reliability. Through systematic experiments on synthetic datasets with controlled schema, expression, and semantic variations, we demonstrate that STED achieves superior performance ($0.86-0.90$ similarity for semantic equivalents, $0.0$ for structural breaks) compared to existing metrics including TED, BERTScore, and DeepDiff. Applying our consistency scoring framework to benchmark six LLMs reveals significant performance variations: Claude-3.7-Sonnet demonstrates exceptional consistency, maintaining near-perfect structural reliability even at high temperatures ($T=0.9$), while models like Claude-3-Haiku and Nova-Pro exhibit substantial degradation requiring careful parameter tuning. Our framework enables practical applications including targeted model selection for structured output tasks, iterative prompt refinement for reproducible results, and diagnostic analysis to identify root causes of inconsistency. This work provides both theoretical foundations and practical tools for ensuring reliable structured output generation in LLM-based production systems.
Chat is not available.
Successful Page Load