Skip to yearly menu bar Skip to main content

Workshop: Synthetic Data for Empowering ML Research

Improving dermatology classifiers across populations using images generated by large diffusion models

Luke Sagers · James Diao · Matt Groh · Pranav Rajpurkar · Adewole Adamson · Arjun Manrai


Dermatological classification algorithms developed without sufficiently diverse training data may generalize poorly across populations. While more intentional data collection and annotation is the best way to increase representation, new computational approaches for generating training data may also aid in reducing representation bias. In this paper, we show that DALL·E 2, a large text-to-image diffusion model, can generate synthetic and photorealistic skin disease images across skin types. Using the Fitzpatrick 17k dataset as a benchmark, we demonstrate that including DALL·E 2-generated synthetic images improves classification accuracy of skin disease models overall and particularly for underrepresented groups.

Chat is not available.