Skip to yearly menu bar Skip to main content

Workshop: Synthetic Data Generation with Generative AI

Synthetic Health-related Longitudinal Data with Mixed-type Variables Generated using Diffusion Models

Nicholas Kuo · Louisa Jorm · Sebastiano Barbieri

Keywords: [ synthetic data; generative adversarial networks; diffusion models; electronic health records; mixed-typed dataset; time-series dataset ]


This paper introduces a novel method for simulating Electronic Health Records (EHRs) using Diffusion Probabilistic Models (DPMs). We showcase the ability of DPMs to generate longitudinal EHRs with mixed-type variables – numeric, binary, and categorical. Our approach is benchmarked against existing Generative Adversarial Network (GAN)-based methods in two clinical scenarios: management of acute hypotension in the intensive care unit and antiretroviral therapy for people with human immunodeficiency virus. Our DPM-simulated datasets not only minimise patient disclosure risk but also outperform GAN-generated datasets in terms of realism. These datasets also prove effective for training downstream machine learning algorithms, including reinforcement learning and Cox proportional hazards models for survival analysis.

Chat is not available.