Big Data
It is estimated that 2.5 quintillion (1018) bytes of data are created each day. This means that 90% of all the data in the world today has been created in the last two years. This “big” data come from everywhere, social media, pictures and videos, financial transactions, telephones, governments, medical, academic, and financial institutions, and private companies. Needless to say the data are highly distributed in what has become known as the “cloud,”
There is a need to statistically analyze this data. For many applications, the data are private and require the analysis to be secure. As used herein, secure means that privacy of the data is preserved, such as the identity of the sources for the data, and the detailed content of the raw data. Randomized response is one prior art way to do this. Random response does not unambiguously reveal the response of a particular respondent, but aggregate statistical measures, such as the mean or variance, can still be determined.
Differential privacy (DP) is another way to preserve privacy by using a randomizing function, such as Laplacian noise. Informally, differential privacy means that the result: of a function determined on a database of respondents is almost insensitive to the presence or absence of a particular respondent. Formally, if the function is evaluated on adjacent databases differing in only one respondent, then the probability of outputting the same result is almost unchanged.
Conventional mechanisms for privacy, such as k-anonymization are not differentially private, because an adversary can link an arbitrary amount of helper (side) information to the anonymized data to defeat the anonymization.
Other mechanisms used to provide differential privacy typically involve output perturbation, e.g., noise is added to a function of the data. Nevertheless, it can be shown that the randomized response mechanism, where noise is added to the data itself, provides DP.
Unfortunately, while DP provides a rigorous and worst-case characterization for the privacy of the respondents, it is not enough to formulate privacy of an empirical probability distribution or “type” of the data. In particular, if an adversary has accessed anonymized adjacent databases, then the DP mechanism ensures that the adversary cannot de-anonymize any respondent. However, by construction, possessing an anonymized database reveals the distribution of the data.
Therefore, there is a need to preserve privacy of the respondents, while also protecting an empirical probability distribution from adversaries.
In U.S. application Ser. No. 13/032,521, Applicants disclose a method for processing data by an untrusted third party server. The server can determine aggregate statistics on the data, and a client: can retrieve the outsourced data exactly. In the process, individual entries in the database are not revealed to the server because the data are encoded. The method uses a combination of error correcting codes, and a randomization response, which enables responses to be sensitive while maintaining confidentiality of the responses.
In U.S. application Ser. No. 13/032,552. Applicants disclose a method for processing data securely by an untrusted third party. The method uses a cryptographically secure pseudorandom number generator that enables client data to be outsourced to an untrusted server to produce results. The results can include exact aggregate statistics on the data, and an audit report on the data. In both cases, the server processes modified data to produce exact results, while the underlying data and results are not revealed to the server.