Generating Differentially Private Synthetic Data

Loading...
Thumbnail Image

Persistent link to this item

Statistics
View Statistics

Journal Title

Journal ISSN

Volume Title

Title

Generating Differentially Private Synthetic Data

Alternative title

Published Date

2023-09

Publisher

Type

Thesis or Dissertation

Abstract

This thesis explores the generation of tabular synthetic data—structured data organized into rows and columns—as a strategy to protect data privacy. Notably, there are tabular datasets in domains like health- care and finance, which hold great potential for public benefit but often remain underutilized due to stringent privacy regulations. Therefore, the primary focus of this research is to introduce methods that produce tabular differentially private synthetic data (DPSD) that can approximate the statistical properties of an underlying sensitive dataset, permitting aggregate data analysis on DPSD without revealing actual data. Differential privacy is a mathematical framework proposed by Dwork et al., 2006c with provable privacy guarantees, which ensures individual data privacy while preserving the ability to conduct statistical analysis on the data. The DPSD approach has gained popularity as it is a versatile strategy for private data analysis. One advantage is the ability to reuse DPSD for various data analysis tasks without degrading the privacy guarantees of the source data. However, generating DPSD faces many challenges, encompassing utility for complex data analysis tasks and scalability. This thesis focuses on DPSD mechanisms that utilize advanced optimization techniques for finding a DPSD. These techniques encompass integer programs, differentiable optimization, and genetic algorithms. Each chapter delves into the contributions and limitations of these techniques. Comprehensive empirical evaluations are also undertaken with high-dimensional datasets, notably the Adult and American Community Survey (ACS). These evaluations shed light on the evolution of DPSD techniques in recent years.Finally, in chapter 7, a novel method named Private-GSD, based on genetic algorithms, is presented, which sidesteps the limitations found in other methods. The empirical evaluations of Private-GSD use the ACS datasets, containing high-dimensional datasets with both high-cardinality categorical and real- valued features. Generating DPSD for the ACS datasets presents a challenging problem for most methods due to their extremely large search space and their mixed-type composition of features. However, the results reveal that Private-GSD surpasses all other benchmark methods in its ability to preserve more statistical properties of the data as previously possible and its scale to operate in higher dimensional domains.

Description

University of Minnesota Ph.D. dissertation. September 2023. Major: Computer Science. Advisors: Zhiwei Steven Wu, Maria Gini. 1 computer file (PDF); vii, 197 pages.

Related to

Replaces

License

Collections

Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Other identifiers

Suggested citation

Vietri, Giuseppe. (2023). Generating Differentially Private Synthetic Data. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/260149.

Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.