MedVAL: Toward Expert-Level Medical Text Validation with Language Models
Asad Aali · Vasiliki Bikia · Maya Varma · Nicole Chiou · Sophie Ostmeier · Arnav Singhvi · Magdalini Paschali · Ashwin Kumar · Andrew Johnston · Karimar Amador-Martinez · Eduardo Guerrero · Paola Rivera · Sergios Gatidis · Christian Bluethgen · Eduardo Reis · Eddy van Rilland · Poonam Hosamani · Kevin Keet · Minjoung Go · Evelyn Ling · David Larson · Curtis Langlotz · Roxana Daneshjou · Jason Hom · Sanmi Koyejo · Emily Alsentzer · Akshay Chaudhari
Abstract
With the growing use of language models (LMs) in clinical environments, there is an immediate need to evaluate the accuracy and safety of LM-generated medical text. Currently, such evaluation relies solely on manual physician review. However, detecting errors in LM-generated text is challenging because 1) manual review is costly and 2) expert-composed reference outputs are often unavailable in real-world settings. While the “LM-as-judge” paradigm (a LM evaluating another LM) offers scalable evaluation, even frontier LMs can miss subtle but clinically significant errors. To address these challenges, we propose MedVAL, a self-supervised framework that leverages synthetic data to train evaluator LMs to assess whether LM-generated medical outputs are factually consistent with inputs, without requiring physician labels or reference outputs. To evaluate LM performance, we introduce MedVAL-Bench, a dataset of 840 physician-annotated outputs across 6 diverse medical tasks capturing real-world challenges, including a multilingual task reviewed by bilingual physicians. Each output is reviewed following a physician-defined taxonomy of risk levels and error categories, enabling evaluation of LMs in making safety decisions for deployment. Across 10 state-of-the-art LMs spanning open-source, proprietary, and medically adapted models, MedVAL fine-tuning significantly improves ($p < 0.001$) alignment with physicians on both seen and unseen tasks, increasing average F1 scores from 66\% to 83\%, with per-sample safety classification scores up to 86\%. MedVAL improves the performance of even the best-performing proprietary LM (GPT-4o) by 8\%. Our research provides the first evidence of LMs approaching expert-level validation ability for medical text.
Video
Chat is not available.
Successful Page Load