Can You Spot the Virtual Patient (VP)? Expert Evaluation, Turing Test, Linguistic Analysis, and Semantic Similarity Analysis
Reyhaneh Hosseinpourkhoshkbari · Wei-chen Huang · Suvel Muttreja · Richard M Golden
Abstract
Communication is a critical clinical skill, yet scalable, realistic training tools remain limited. Large language model (LLM)-based virtual patients (VPs) offer a promising alternative to traditional tools, but their conversational realism remains underexplored. In this study, we evaluate the realism of GPT-4o-generated VPs using a multi-method approach: expert review, Turing-style testing, linguistic analysis, and semantic similarity. We generated 44 VPs based on real doctor–patient dialogues. Expert annotations of hallucinations, omissions, and repetitions showed high interrater reliability ($ICC > 0.77$). In a Turing test, participants struggled to distinguish VPs from real patients—classification accuracy fell below chance. Linguistic analysis of 2,000+ dialogue turns revealed that VPs produced formal, lexically consistent responses, while human patients showed more emotional and stylistic variability. Semantic similarity scores averaged 0.871 (response-level) and 0.842 (transcript-level), indicating strong alignment. These findings support the use of LLM-based VPs in communication training and offer insights into realism, trust, and refinement, contributing to the safe and responsible deployment of generative AI in healthcare.
Chat is not available.
Successful Page Load