Poster
in
Workshop: Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling

LLMs as Judges for Domain-Specific Text: Evidence from Drilling Reports

abdallah benzine · Soumyadipta Sengupta · J.S. Buiting · Imane Khaouja · Yahia Salaheldin Shaaban · Amine KHAIR

Project Page [ Slides] [ Poster] [ OpenReview]

Abstract

Large language models are now judged by other models in many workflows. This scales, but it is risky in domains where facts, numbers, and terminology matter. We study this in an industrial data-to-text setting: short, structured reports generated from time-series sensor data. The task is Daily Drilling Report (DDR) sentence generation, but the lessons apply to any domain-grounded pipeline.We evaluate LLMs used as judges under three protocols: a minimal single score, a weighted multi-criteria score, and a multi-criteria scheme with external aggregation. We compare model sizes and prompt designs using agreement metrics with human experts. Larger judges improve consistency, yet prompt and aggregation choices still cause large shifts in reliability and calibration. Smaller judges fail to track numeric and terminology constraints even with structure.The takeaways are practical. Good evaluation needs domain knowledge in the rubric, transparent aggregation, and stress tests that expose failure modes of LLM-as-judge. Our study offers a blueprint for building such evaluations in data-to-text applications and a caution against treating general-purpose judges as drop-in replacements for expert assessment.

Chat is not available.