LaTeXBench: Judge-Only Evaluation of LaTeX Generation, Minimal-Edit Compliance, and Blind Contrast Errors
Abstract
Large language models are increasingly used to author scientific documents in LaTeX, where success depends on structural validity, precise constraint adherence, and fault awareness beyond surface fluency. We present LaTeXBench, a compact, judge-only benchmark targeting three structure-aware abilities: (1) Generation—produce syntactically valid LaTeX that satisfies explicit structural requirements; (2) Edit-Compliance—apply only requested edits while preserving unrelated content byte-for-byte; and (3) Blind Contrast—detect and classify a single seeded fault from a closed taxonomy. A single deterministic workbook provides 150 items (50 per family). Scoring is fully automatic via strict JSON outputs from an LLM judge, with Wilson binomial intervals to quantify small-n uncertainty. We release prompts, runners, seeds, and plotting scripts to support transparent replication. Evaluations across production models show high contrast detection and specificity, but notably lower obedience on minimal-edit tasks, underscoring structure-preserving editing as a key bottleneck. LaTeXBench offers an inexpensive, auditable base layer for measuring code-like behaviors in document tooling and guiding future models specialized for LaTeX structure and edits.