One of the most noticeable technological trends is the emergence and proliferation of large-scale distributed databases. Public and private enterprises are collecting tremendous amounts of data on individuals, their activities, their preferences, their locations, spending habits, medical and financial histories, and so on. These enterprises include government organizations, health providers, financial institutions, Internet search engines, social networks, cloud service providers, and many others. Naturally, interested parties could potentially discern meaningful patterns and gain valuable insights if they were able to access and correlate the information across the databases.
For example, a researcher may want to determine the correlations between individual income with personal characteristics such as gender, race, age, education, etc., or a medical researcher may want to study the relationships between disease prevalence and individual environmental factors. In such applications, it is imperative to maintain the privacy of individuals, while ensuring that the useful aggregate statistical information is only revealed to the authorized parties. Indeed, unless the public is satisfied that their privacy is being preserved, they would not provide their consent for the collection and use of their personal information. Additionally, the inherent distribution of this data across multiple databases present a significant challenge, as privacy concerns and policy would likely prevent direct sharing of data to facilitate statistical analysis in a centralized fashion. Thus, tools must be developed for preforming statistical analysis on large and distributed databases, while addressing these privacy and policy concerns.
It is known that conventional mechanisms for privacy, such as k-anonymization do not provide adequate privacy. Specifically, an informed adversary can link an arbitrary amount of side information to anonymized database, and defeat the anonymization mechanism. In response to vulnerabilities of simple anonymization mechanisms, a stricter notion of privacy, known as differential privacy, has been developed. Informally, differential privacy ensures that the result of a function computed on a database of respondents is almost insensitive to the presence or absence of a particular respondent. A more formal way states that when the function is evaluated on databases, differing in only one respondent, the probability of outputting the same result is almost unchanged.
Mechanisms that provide differential privacy typically involve output perturbation, e.g., when Laplacian noise is added to the result of a function computed on a database, the noise provides differential privacy to the individual respondents in the database. Nevertheless, it can be shown that input perturbation approaches, such as the randomized response mechanism, where noise is added to the data, also provide differential privacy to the respondents.
It is desired to protect the privacy of individual respondents in a database, to prevents unauthorized parties from computing a joint or marginal empirical probability distributions of the data, and to achieves a superior tradeoff between privacy and utility compared to simply performing post randomization (PRAM) on the database.
Sampling can be used for crowd-blending privacy. This is a strictly relaxed version of differential privacy, but it is known that a pre-sampling step applied to a crowd-blending privacy mechanism can achieve a desired amount of differential privacy.
The related application Ser. No. 13/676,528 first randomizes independently data X and Y to obtain randomized data {circumflex over (X)} and Ŷ. The first randomizing preserves the privacy of the data X and Y. Then, the randomized data {circumflex over (X)} and Ŷ are randomized secondly to obtain randomized data {tilde over (X)} and {tilde over (Y)} for a server, and helper information T{tilde over (X)}|{circumflex over (X)} and TŶ|Ŷ for a client, where T represents an empirical distribution, and where the randomizing secondly preserves the privacy of the aggregate statistics of the data X and Y. The server then determines statistics T{tilde over (X)},{tilde over (Y)}. Last, the client applies the helper information T{tilde over (X)}|{circumflex over (X)} and TŶ|Ŷ to T{tilde over (X)},{tilde over (Y)} to obtain an estimated {dot over (T)}X,Y, wherein “|” and “,” between X and Y represent a conditional and joint distribution, respectively.