PSMBench: A Benchmark and Dataset for Evaluating LLMs Extraction of Protocol State Machines from RFC Specifications
Zilin Shen · Xinyu Luo · Imtiaz Karim · Elisa Bertino
Abstract
Accurately extracting protocol-state machines (PSMs) from the long, densely written Request-for-Comments (RFC) standards that govern Internetāscale communication remains a bottleneck for automated security analysis and protocol testing. In this paper, we introduce RFC2PSM, the first large-scale dataset that pairs 1,580 pages of cleaned RFC text with 108 manually validated states and 297 transitions covering 14 widely deployed protocols spanning the data-link, transport, session, and application layers. Built on this corpus, we propose PsmBench, a benchmark that (i) feeds chunked RFC to an LLM, (ii) prompts the model to emit a machine-readable PSM, and (iii) scores the output with structure-aware, semantic fuzzy-matching metrics that reward partially correct graphs.A comprehensive baseline study of nine state-of-the-art open and commercial LLMs reveals a persistent stateātransition gap: models identify many individual states (up to $0.82$ F1) but struggle to assemble coherent transition graphs ($\leq 0.38$ F1), highlighting challenges in long-context reasoning, alias resolution, and action/event disambiguation. We release the dataset, evaluation code, and all model outputs as open-sourced, providing a fully reproducible starting point for future work on reasoning over technical prose and generating executable graph structures. RFC2PSM and PsmBench aim to catalyze cross-disciplinary progress toward LLMs that can interpret and verify the protocols that keep the Internet safe.
Successful Page Load