Workshop: Synthetic Data for Empowering ML Research

PRISIM: Privacy Preserving Synthetic Data Simulator

Subhrajit Samanta · Shantanu Chandra · PKS Prakash · Srinivas Chilukuri · Srinivas Alva

[ Abstract ] [ Website ]
[ OpenReview
Fri 2 Dec 7:54 a.m. PST — 7:56 a.m. PST

Abstract: Data sharing in a collaborative environment is instrumental to propel innovation; however, privacy can pose a serious threat when sharing data as it comes with the risk of sensitive information leakage. On the other hand, analytical utility is another key factor to consider while sharing data to ensure its usability. Therefore, this research primarily focuses on the assessment and preservation of privacy and utility within centralized tabular data which is one of the most common types of data used across industries (e.g. HR, CRM, healthcare). The state-of-the-art (SOTA) centralized privacy preservation techniques, such as statistical anonymization (using generalization, binning, suppression, etc.) and differential privacy (DP) methods focus heavily on data privacy and ignore the analytical utility to a large extent. Hence, in this paper we propose a novel synthetic data generation-based approach with a statistical distance-based privacy-preserving mechanism (the framework is referred to as PRISM) to ensure analytically useful private synthetic data. %A new distance metric is also proposed by combining the Jaccard similarity index (JSI) and Mahalanobis distance (MD) to simulate a re-identification attack on mixed-type data. PRISIM is validated across five open-source data sets and compared against SOTA Differentially Private GANs and we observed on average $>20\%$ higher retention of utility while maintaining a similar level of privacy.

Chat is not available.