Today, governments and corporations collect massive amounts of data about people. Such “user data” can be mined for insights or used to create useful computer systems, such as recommendation engines. For example, e-commerce sites often track a user's shopping history and analyze it to recommend new products in which the user may be interested. Similarly, online movie streaming applications may track a user's viewing history and/or self-reported ratings to suggest additional movies that the user may be interested in viewing.
As the amount of valuable data being collected has increased, so has the demand for exchange of such information. For example, the Netflix™ online DVD rental service recently published a user dataset of 100M ratings of over 17K movies by 500K entities and offered a cash prize for new algorithms for mining that data. The release of user data to the public or among private parties is inevitable given the value and uses of such data.
Given the trend towards release of user data, user privacy has become an important concern. Users are made uncomfortable by the prospect of having so much of their personal information being shared with various, often unidentified, third parties.
Privacy preserving data publishing (PPDP) is a field of research that focuses on manipulating a user dataset to create greater user anonymity while still maintaining the value of the dataset. Using PPDP techniques, a data publisher might “anonymize” a dataset and release the anonymized dataset to a third party rather than the original data set. Thus, the recipient of the data may be able to use the data for meaningful data mining activities but cannot learn particularly private information about each user.
Various PPDP techniques have been developed. For example, one simple technique is to replace entities' names with anonymous identifiers (e.g., random numbers) or to remove such names altogether. More complex techniques may be aimed at preventing malicious actors from reverse-engineering personal user information from the data when considered as a whole. Such techniques include approaches such as perturbation and k-anonymity.
In perturbation, the data values themselves are perturbed such that some data would be masked while other properties preserved. Perturbation techniques that have been studied include randomization, rotation perturbation, geometric perturbation, and others.
In k-anonymity, attempts to protect data by constructing groups of anonymous records, such that every tuple in the original user data is indistinguishably related to no fewer than k users. Although several algorithms have been proposed for finding optimal (i.e., minimal) k-anonymous tables, the application of those algorithms is limited in practice because the k-anonymity problem is NP-hard (Non-deterministic polynomial-time hard). Nevertheless, various approximation algorithms and heuristics have emerged.