Generating Differentially Private Synthetic Data

Vietri, Giuseppe2024-01-192024-01-192023-09https://hdl.handle.net/11299/260149University of Minnesota Ph.D. dissertation. September 2023. Major: Computer Science. Advisors: Zhiwei Steven Wu, Maria Gini. 1 computer file (PDF); vii, 197 pages.This thesis explores the generation of tabular synthetic data—structured data organized into rows and columns—as a strategy to protect data privacy. Notably, there are tabular datasets in domains like health- care and finance, which hold great potential for public benefit but often remain underutilized due to stringent privacy regulations. Therefore, the primary focus of this research is to introduce methods that produce tabular differentially private synthetic data (DPSD) that can approximate the statistical properties of an underlying sensitive dataset, permitting aggregate data analysis on DPSD without revealing actual data. Differential privacy is a mathematical framework proposed by Dwork et al., 2006c with provable privacy guarantees, which ensures individual data privacy while preserving the ability to conduct statistical analysis on the data. The DPSD approach has gained popularity as it is a versatile strategy for private data analysis. One advantage is the ability to reuse DPSD for various data analysis tasks without degrading the privacy guarantees of the source data. However, generating DPSD faces many challenges, encompassing utility for complex data analysis tasks and scalability. This thesis focuses on DPSD mechanisms that utilize advanced optimization techniques for finding a DPSD. These techniques encompass integer programs, differentiable optimization, and genetic algorithms. Each chapter delves into the contributions and limitations of these techniques. Comprehensive empirical evaluations are also undertaken with high-dimensional datasets, notably the Adult and American Community Survey (ACS). These evaluations shed light on the evolution of DPSD techniques in recent years.Finally, in chapter 7, a novel method named Private-GSD, based on genetic algorithms, is presented, which sidesteps the limitations found in other methods. The empirical evaluations of Private-GSD use the ACS datasets, containing high-dimensional datasets with both high-cardinality categorical and real- valued features. Generating DPSD for the ACS datasets presents a challenging problem for most methods due to their extremely large search space and their mixed-type composition of features. However, the results reveal that Private-GSD surpasses all other benchmark methods in its ability to preserve more statistical properties of the data as previously possible and its scale to operate in higher dimensional domains.enDifferential privacySynthetic dataGenerating Differentially Private Synthetic DataThesis or Dissertation