Poster
in
Workshop: The Second Workshop on GenAI for Health: Potential, Trust, and Policy Compliance

Demo: Statistically Significant Results on Biases and Errors of LLMs Do Not Guarantee Generalizable Results

Jonathan Liu ⋅ Damianos Karakos ⋅ Mark Dredze ⋅ Jonathan Lasko ⋅ Haoling Qiu ⋅ Mahsa Yarmohammadi

2025 Poster
in
Workshop: The Second Workshop on GenAI for Health: Potential, Trust, and Policy Compliance

Project Page [ OpenReview]

Abstract

Recent research has shown that hallucinations, omissions, and biases are prevalent in everyday use-cases of LLMs. However, chatbots used in medical contexts should provide consistent advice in situations where non-medical factors are involved, such as demographic information which is not clinically relevant to the question. We try to understand the conditions under which medical chatbots fail to perform as expected by creating an infrastructure that 1) automatically creates prompts to probe LLMs and 2) evaluates their answers using multiple steps and subsystems, including LLM-as-judge. For 1), our prompt creation pipeline samples the space of patient demographics, histories, disorders, and writing styles to create realistic questions that we subsequently use to prompt LLMs. In 2), our evaluation pipeline provides hallucination and omission detection using LLM-as-a-judge as well as agentic workflows, in addition to LLM-as-a-judge treatment category detectors. Finally, using a subset of our 3.7M prompt dataset, we discover that only specific answering & evaluation LLM pairs produce statistically significant differences between treatment categorization in genders and races. We recommend that studies using LLM evaluation use multiple LLMs as evaluator in order to avoid arriving at statistically significant but non-generalizable results, especially when ground-truth data is not readily available.

Chat is not available.