Differentially Private Data Generation with Missing Data

This repository contains the codebase and the full paper for the paper Differentially Private Data Generation with Missing Data. All baselines except Kamino can be accessed using the config.py file. Kamino has a sub repository inside this codebase and need to run from the respective folder.

Setup

We use python==3.7, torch>=1.7. The environment can be quickly built using conda with the command conda env create -f=environment.yml $ conda activate missing_syn

Generate Synthetic Datasets

We now describe how to run the different baselines from the paper on the Adult dataset.

1.) Prepare PrivBayes: First, build PrivBayes using cd PrivBayes && make clean && make. The script will build the C++ files for PrivBayes. Note that some of the build might need to be changed.

2.) Use config:

config.py can be used to run different baselines.

params = {
    'orig_data_loc': './datasets/Original', #the location to find original datasets
    'dataset': 'adult', #ground truth dataset for synthetic data generation
    'epsilon': `, #Total privacy budget
    'runs': 1, #To repeat the experiment for multiple runs
    'missing_p': 0.2, #Percentage of missing values
    'missing_type': 'MCAR', #Missing mechanism to choose from [MCAR, MAR, MNAR]
    'baselines' : ['misgan', 'DPautoGAN', 'PrivBayes2'], #To choose from ['misgan', 'DPautoGAN', 'DPCTGAN', 'PrivBayes', 'PrivBayes2'] PrivBayesE is called PrivBayes2,
    'bin_size' : 10 #Number of bins for continuous attributes
    }

The missing datasets will automatically be created and stored in the respective folder in datasets/.

3.) Generate Dataset: Finally, we can generate a synthetic dataset by running python main.py This script will train the baselines from config.py and automatically evaluate the metrics by running evaluation.py

Amplified privacy

The amplified privacy cost to the ground truth data can be calculated for PrivBayesE. First, the marginals from PrivBayesE need to be extracted and put in the ./PrivBayesE_marginals folder. Some examples are already included. The amplified cost can then be calculated by executed python opt_amp_cost.py and setting which marginal need to be optimized inside opt_amp_cost.py.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
DPautoGAN		DPautoGAN
PrivBayes		PrivBayes
PrivBayesE_marginals		PrivBayesE_marginals
aim		aim
datasets		datasets
kamino		kamino
misgan		misgan
results/ABCDE/4OCBZV/dataset		results/ABCDE/4OCBZV/dataset
snsynth		snsynth
README.md		README.md
complete_evals.py		complete_evals.py
config.py		config.py
environment.yml		environment.yml
evaluation.py		evaluation.py
full_paper.pdf		full_paper.pdf
impute.py		impute.py
main.py		main.py
missing_mechanisms.py		missing_mechanisms.py
opt_amp_cost.py		opt_amp_cost.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
run_all.sh		run_all.sh
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Differentially Private Data Generation with Missing Data

Setup

Generate Synthetic Datasets

Amplified privacy

About

Releases

Packages

Languages

mshubhankar/DP-DataGeneration-MissingData

Folders and files

Latest commit

History

Repository files navigation

Differentially Private Data Generation with Missing Data

Setup

Generate Synthetic Datasets

Amplified privacy

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages