Introduction

 

     є- Differential privacy (є-DP for short) [1][2] is emerging as a promising model for protecting sensitive data which needs to be realsed. It provides strong guarantee that any adversary is unable to identify any personnel included in the database, thus preventing the leakage on the sensitive attributes of such personnel.
      For example, assume we have a database of medical records from a specific hospital, where  'Y' or 'N' denotes whether a person has diabetes or not. If the hospital intends to directly release a statistical report on the age distribution of diabetic patients in the database (see the histogram below), the privacies of individuals may be leaked. For example, if someone (often termed as an adversary) happens to know Alice is the only person who is between 20 and 30 years old in the hospital, he can confirm that Alice has dibetes.
     To enable the release of sensitve data and protect the privacy of individules at the same time, Dwork et al. proposed є-differential privacy model[2] which requires the data to be published to be randomly perturbed in advance. The perturbation must ensure that the noisy output is not very sensitive to any particular record in the database. Intuitively speaking, a large parameter є indicates a high level of data confidentiality and thus great scale of noises are added to the statistical data.

                         

     Recently, researchers have designed several differentially private approaches which ensure the є-differential privacy and meanwhile improve the accuracy of the realsed data [1][3][4]. However, the existing studies on є-DP mostly focus on simple aggregation queries such as counts. Our work investigates the publication of DP-compliant histograms. Compared to simple aggregations whose results are purely numerical values, a histogram query is inherently more complex, since it must also determine its structure, i.e., the ranges of the bins. In our work, we observe that a DP-compliant histogram with finer bins may actually lead to significantly lower accuracy than a coarser one, since the former requires stronger perturbations in order to satisfy DP. Moreover, the histogram structure itself may reveal sensitive information about the underlying data, which further complicates the problem.
     Motivated by this, we propose two novel algorithms, namely StructureFirst and NoiseFirst for computing high-accuracy DP histograms. The former decides the histogram structure before injecting noises required by DP, while the latter reverses the order of these steps. In particular, both NoiseFirst and StructureFirst rely on the commonly used and well-studied histogram technique on data compressionand summarization[5]. While traditional histogram only intends to merge neighboring similar counts to save storage space, we show that it is also helpful to reduce the sensitivities of the synopsis with respect to privacy attacks. This advantage leads to effective cutdown on the random noises for privacy protection purpose and improves the accuracies of queries based on the histogram of published data. Despite of the benefits from the histogram synopsis, the construction of optimal histogram raises new challenges, because of the potential privacy information exposed with the public structure of the histogram. To avoid any privacy leakage, we propose some new strategies to protect the structural information of the histograms. To meet the requirements from more complicated queries, we also present how to enhance the effectiveness of histogram synopsis for range query processing. Extensive experiments, using several real data sets, confirm that the proposed methods output highly accurate query answers, and consistently outperform existing competitors[1][3][4].    
     In the following Figures, green lines show the original distributions of people's ages which is derived from the census records of individuals in Brazil[6]. Blue lines and red lines exhibit the perturbed distributions produced by PreDP and PostDP separately. A larger є implies bigger noises. Based on the observations of the following figures, NoiseFirst produces smaller deviation from the original distribution and thus performs outstanding for unit-lengh count queries (e.g., querying the number of people whose ages are within[20, 25]). StructureFirst , on the other hand, has advantages with most settings of the query lengths. The NoiseFirst, to the best of our knowledge, is the 1st approach that can outperform Dwork et al.'s method with respect to the unit-length queries. This provides us a possibility to make a better visualization of the statistic data in a user interface. On the other hand, a query executer can provide the users more precise results by querying the DP-complaint histogram published by the StructureFirst approach. Thus, our NoiseFirst and StructureFirst approaches is not exclusive with each other. Particularly, they can be simultaneously embedded into a mainstream DBMS to better serve the users in different modules.

      

Codes & data

 
Please refer to  [C++ Implementation]

References

 
[1] C. Dwork, F. McSherry, K. Nissim and A. Smith. Calibrating Noise to Sensitivity in Private Data Analysis. In Theory of Cryptography Conference (TCC), Springer, 2006.
[2]  Wikipedia: http://en.wikipedia.org/wiki/Differential_privacy
[3]  M. Hay, V. Rastogi, G. Miklau, and D. Suciu. Boosting the accuracy of differentially private histograms through consistency. PVLDB, 3(1):1021-1032.
[4]  X. Xiao, G. Wang, and J. Gehrke. Differential privacy via wavelet transforms. In ICDE, 2010.
[5]  H.V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik, and T. Suel. Optimal histograms with quality guarantees. In VLDB, 1998.
[6]  Minnesota Population Center, "Integrated public use microdata series-international: Version 5.0",2009.

Contact

 

If you encounter any problem with the implementation of our algorithms, please feel free to contact (xujia AT ise.neu.edu.cn).