Generating Differentially Private Synthetic Data
2023-09
Loading...
View/Download File
Persistent link to this item
Statistics
View StatisticsJournal Title
Journal ISSN
Volume Title
Title
Generating Differentially Private Synthetic Data
Alternative title
Authors
Published Date
2023-09
Publisher
Type
Thesis or Dissertation
Abstract
This thesis explores the generation of tabular synthetic data—structured data organized into rows and columns—as a strategy to protect data privacy. Notably, there are tabular datasets in domains like health- care and finance, which hold great potential for public benefit but often remain underutilized due to stringent privacy regulations. Therefore, the primary focus of this research is to introduce methods that produce tabular differentially private synthetic data (DPSD) that can approximate the statistical properties of an underlying sensitive dataset, permitting aggregate data analysis on DPSD without revealing actual data. Differential privacy is a mathematical framework proposed by Dwork et al., 2006c with provable privacy guarantees, which ensures individual data privacy while preserving the ability to conduct statistical analysis on the data. The DPSD approach has gained popularity as it is a versatile strategy for private data analysis. One advantage is the ability to reuse DPSD for various data analysis tasks without degrading the privacy guarantees of the source data. However, generating DPSD faces many challenges, encompassing utility for complex data analysis tasks and scalability. This thesis focuses on DPSD mechanisms that utilize advanced optimization techniques for finding a DPSD. These techniques encompass integer programs, differentiable optimization, and genetic algorithms. Each chapter delves into the contributions and limitations of these techniques. Comprehensive empirical evaluations are also undertaken with high-dimensional datasets, notably the Adult and American Community Survey (ACS). These evaluations shed light on the evolution of DPSD techniques in recent years.Finally, in chapter 7, a novel method named Private-GSD, based on genetic algorithms, is presented, which sidesteps the limitations found in other methods. The empirical evaluations of Private-GSD use the ACS datasets, containing high-dimensional datasets with both high-cardinality categorical and real- valued features. Generating DPSD for the ACS datasets presents a challenging problem for most methods due to their extremely large search space and their mixed-type composition of features. However, the results reveal that Private-GSD surpasses all other benchmark methods in its ability to preserve more statistical properties of the data as previously possible and its scale to operate in higher dimensional domains.
Keywords
Description
University of Minnesota Ph.D. dissertation. September 2023. Major: Computer Science. Advisors: Zhiwei Steven Wu, Maria Gini. 1 computer file (PDF); vii, 197 pages.
Related to
Replaces
License
Collections
Series/Report Number
Funding information
Isbn identifier
Doi identifier
Previously Published Citation
Other identifiers
Suggested citation
Vietri, Giuseppe. (2023). Generating Differentially Private Synthetic Data. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/260149.
Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.