@InProceedings{Vu:CiCLing:2019b,
author = {Xuan-Son Vu, Son N. Tran, Lili Jiang},
title = {dpUGC: Learn Differentially Private Representation for User Generated Contents},
booktitle = {Proceedings of the 20th International Conference on Computational Linguistics and Intelligent Text Processing, April, 2019},
year = {2019},
location = {La Rochelle, France}
}
- This is for training private embedding on new data:
cd codes/
./10.run_train_dp_embedding.sh
Create a python environment using virtualenv or anaconda, then run this command in that environment:
pip install -r requirements.txt
- How to run:
cd codes/
./01.run_changes_in_semantic_spaces.sh
-
Expected outputs: see the evaluation results from the console (see codes/images/results_fig2.png). The results should be similar to Figure 2 in the paper.
-
Note: It takes time to train the embedding model so we already extracted the top similarity words to run the evaluation. If it's needed, users are able to train again from the text8 corpus and select the top similar words and run this evaluation again.
- How to run:
cd codes/
./02.run_evaluation_regression_task.sh
- Expected outputs: printed in the console (see codes/images/results_table3.png). Evaluation results similar to Table 3 in the paper. Note that the privacy-budget column was stated after the training embedding with differential privacy, i.e., one according privacy-budget (\delta, \epsilon) is calculated at each checking point.