Data collection provides information for a wide variety of academic, industrial, business, and government purposes. For example, data collection is necessary for sociological studies, market research, and in a census. To maximize the utility of collected data, all data can be amassed and made available for analysis without any privacy controls. Of course, most people and organizations (“privacy principals”) are unwilling to disclose all data, especially when data are easily exchanged and could be accessed by unauthorized persons. Privacy guarantees can improve the willingness of privacy principals to contribute their data, as well as to reduce fraud, identity theft, extortion, and other problems that can arise from sharing data without adequate privacy protection.
A method for preserving privacy is to compute collective results of queries performed over collected data, and disclose such collective results without disclosing the inputs of the participating privacy principals. For example, a medical database might be queried to determine how many people in the database are HIV positive. The total number of people that are HIV positive can be disclosed without disclosing the names of the individuals that are HIV positive. Useful data are thus extracted while ostensibly preserving the privacy of the principals to some extent.
However, adversaries might apply a variety of techniques to predict or narrow down the set of individuals from the medical database who are likely to be HIV positive. For example, an adversary might run another query that asks how many people both have HIV and are not named John Smith. The adversary may then subtract the second query output from the first, and thereby learn the HIV status of John Smith without ever directly asking the database for a name of a privacy principal. With sensitive data, it is useful to provide verifiable privacy guarantees. For example, it would be useful to verifiably guarantee that nothing more can be gleaned about any specific privacy principal than was known at the outset.
Adding noise to a query output can enhance the privacy of the principals. Using the example above, some random number might be added to the disclosed number of HIV positive principals. The noise will decrease the accuracy of the disclosed output, but the corresponding gain in privacy may warrant this loss.
The concept of adding noise to a query result to preserve the privacy of the principals is generally known. One method uses differentially private classifiers for protecting the privacy of individual data instances using added noise. A classifier evaluated over a database is said to satisfy differential privacy if the probability of the classifier producing a particular output is almost the same regardless of the presence or absence of any individual data instance in the database.
However, the conventional differentially private classifiers are determined locally for each database and fail to provide privacy when there is a requirement to use those classifiers over multiple databases. Accordingly, there is a need to determine such a classifier for a set of databases that preserves the differential data privacy of each database.