In recent years, there has been an abundance of rich and fine-grained data about individuals in domains such as healthcare, finance, retail, web search, and social networks. It is desirable for data collectors to enable third parties to perform complex data mining applications over such data. However, privacy is an obstacle that arises when sharing data about individuals with third parties, since the data about each individual may contain private and sensitive information.
One solution to the privacy problem is to add noise to the data. The addition of the noise may prevent a malicious third party from determining the identity of a user whose personal information is part of the data or from establishing with certainty any previously unknown attributes of a given user. However, while such methods are effective in providing privacy protection, they may overly distort the data, reducing the value of the data to third parties for data mining applications.
A system is said to provide differential privacy if the presence or absence of a particular record or value cannot be determined based on an output of the system, or can only be determined with a very low probability. For example, in the case of medical data, a system may be provided that outputs answers to queries supplied such as the number of users with diabetes. While the output of such a system may be anonymous in that it does not reveal the identity of the patients associated with the data, a curious user may attempt to make inferences about the presence or absence of patients by varying the queries made to the system and observing the changes in output. For example, a user may have preexisting knowledge about a rare condition associated with a patient and may infer other information about the patient by restricting queries to users having the condition. Such a system may not provide differential privacy because the presence or absence of a patient in the medical data (i.e., a record) may be inferred from the answers returned to the queries (i.e., output).
Typically, systems provide differential privacy (for protecting the privacy of user data stored in a database) by introducing some amount of error or noise to the data or to the results of operations or queries performed on the data to hide specific information of any individual user. For example, noise may be added to each query using a distribution such as a Laplacian distribution. At the same time, one would like the noise to be as small as possible so that the answers are still meaningful. Existing methods may add more error or noise than is necessary or optimal to provide differential privacy protection (i.e., ensuring the privacy goal be met).