Data privacy has become a growing concern due to rapid information dissemination and proliferation spurred by the growing popularity of social networks such as Facebook, Blogger, and Twitter, powerful search engines like Google and Bing, and by the increasing sophistication of data mining algorithms. The ability of these systems to track the action of individuals and reveal individual identities is reminiscent of the omnipresence of the Big Brother character in George Orwell's novel “1984,” which renders big corporations an unassailable advantage from knowing so much about individuals. As a consequence, minor lapses can wind up affecting people's lives and careers. On the other hand, the epic demand for data mining and knowledge discovery in databases finds market relevance in contexts ranging from marketing analytics to retail merchandizing, and derives valuable statistical information for resource-conscious decision-making, thereby benefitting society at large. Therefore, a confluence of privacy confidentiality and accurate statistics presents special challenges. A central theme of privacy-preserving data mining is the interplay between privacy and data utility.
There have been numerous attempts at preserving privacy of data related to individuals in statistical databases that may be publicly accessible (e.g., over the Internet). A frequent objective of preserving data against “data mining” is to hide privacy information about individuals, while at the same time providing information that will be statistically accurate about a group. This problem has become extremely important and has been discussed in database and cryptography communities, mainly for two reasons. One reason is widespread proliferation and accessibility of individual information in statistical databases that may be produced by government organizations or corporations, and the other is increasing sophistication of data-mining algorithms. While objectives of processes introduced herein address the privacy issues associated with a statistical database, the underlying techniques are quite different when applied in the networks such as the Internet due to a need to provide a method that combines data compression with privacy protection.
In general, research related to strategies for privacy-preserving methods have been subsumed into anonymity, data swapping and data perturbation, depending on how privacy is being defined. Anonymity refers to replacing a true identifier in a database with an anonymous identity. For example, if the name of an employee is replaced with a quasi-identifier of compound attributes, it would be difficult at a later time to associate the employee's salary with the employee's true identity. Data swapping refers to a process for disclosure limitation for databases containing categorical variables. The values of sensitive variables are swapped among records in such a way that t-order frequency counts are preserved (i.e., entries in a t-way marginal table). Data perturbation problem refers to techniques for adding noise (e.g., white noise) to each original entry in a database table. Cell suppression methods provide limited statistical content. Controlled rounding methods modify existing statistical correlations, thus alter a proper statistical inference. The release of partial information insures confidentiality by restricting the level of details at which data are released, which often allows for proper statistical inference.
Current processes for preserving private data against data mining use a system component called a curator that is positioned between the original database system and the client. The primary goal of a curator is to answer questions while protecting privacy. Upon receipt of a query, the curator analyzes the query for a possible privacy leak. With respect to a very specific query, the curator adds a certain amount of distortion to the response, or simply ignores the query. On-line analysis of a query is not a trivial task. An attacker may figure out detailed information about individual records by carefully crafting a sequence of seemingly unrelated queries, and then use sophisticated mathematical tools to analyze responses to these queries.
A strategy of simply adding random noise to a query response can be vulnerable against a collaborative attack in which a group of attackers coordinate their effort by issuing the same query to the curator. The privacy of individual records can be compromised by averaging the responses. On-line analysis also requires a substantial amount of computational resources, which could have a significant impact on scalability and query performance. A strategy of creating a perturbed database by adding white noise cannot produce an accurate estimation of aggregated information.
Thus, prior solutions for privacy protection in databases have suffered significant limitations with respect to an ability to preserve privacy in data mining and have become substantial hindrances to constructing and providing general accessibility to such databases. Accordingly, what is needed in the art is a new approach that overcomes the deficiencies in the current solutions.