Streaming k-Medoids for Fair and Scalable Patient Phenotyping under Memory Constraints
Abstract
Clustering offers a powerful route to identify disease phenotypes, but applying distance-based methods at population scale remains challenging. Standard k-medoids with Gower distance, a natural choice for mixed-type clinical data, has quadratic time and memory complexity that renders it infeasible for modern electronic health record (EHR) datasets with hundreds of thousands of patients. We address this barrier with a streaming+coreset k-medoids framework that scales linearly in runtime and uses bounded memory, enabling clustering under modest hardware limits. Our approach combines chunk-wise distance computation, Hungarian alignment of medoids across chunks, and a coreset-based refinement, with optional feature weighting to incorporate domain knowledge. Experiments on a synthetic 200,000-patient asthma dataset informed by literature show that the method (i) matches the accuracy of full-distance clustering, (ii) scales to population-level datasets under 10 GB RAM, and (iii) recovers minority-dominated phenotypes when ethnicity is appropriately weighted. This work demonstrates a practical and broadly applicable framework for large-scale, mixed-type healthcare clustering, motivated by the needs of precision medicine.