Poster

MedJourney: Benchmark and Evaluation of Large Language Models over Patient Clinical Journey

Xian Wu ⋅ Yutian Zhao ⋅ Yunyan Zhang ⋅ Jiageng Wu ⋅ Zhihong Zhu ⋅ Yingying Zhang ⋅ Yi Ouyang ⋅ Ziheng Zhang ⋅ Huimin WANG ⋅ zhenxi Lin ⋅ Jie Yang ⋅ Shuang Zhao ⋅ Yefeng Zheng

2024 Poster

[ Paper]

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities in language understanding and generation, leading to their widespread adoption across various fields. Among these, the medical field is particularly well-suited for LLM applications, as many medical tasks can be enhanced by LLMs. Despite the existence of benchmarks for evaluating LLMs in medical question-answering and exams, there remains a notable gap in assessing LLMs' performance in supporting patients throughout their entire hospital visit journey in real-world clinical practice. In this paper, we address this gap by dividing a typical patient's clinical journey into four stages: planning, access, delivery and ongoing care. For each stage, we introduce multiple tasks and corresponding datasets, resulting in a comprehensive benchmark comprising 12 datasets, of which five are newly introduced, and seven are constructed from existing datasets. This proposed benchmark facilitates a thorough evaluation of LLMs' effectiveness across the entire patient journey, providing insights into their practical application in clinical settings. Additionally, we evaluate three categories of LLMs against this benchmark: 1) proprietary LLM services such as GPT-4; 2) public LLMs like QWen; and 3) specialized medical LLMs, like HuatuoGPT2. Through this extensive evaluation, we aim to provide a better understanding of LLMs' performance in the medical domain, ultimately contributing to their more effective deployment in healthcare settings.

Video

Chat is not available.