Many service providers, including e-commerce websites, collect massive amounts of data about people, such as the people that use a service provided by service provider. Such “user data” can be mined for insights or used to create useful computer systems, such as recommendation engines. For example, e-commerce sites often track a user's shopping history and analyze it to recommend new products in which the user may be interested. Similarly, online movie streaming providers may track a user's viewing history and/or self-reported ratings to suggest additional movies that the user may be interested in viewing.
As the amount of valuable data being collected has increased, so has the demand for exchange of such information. For example, the Netflix™ online DVD rental service recently published a user dataset of 100M ratings of over 17K movies by 500K entities and offered a cash prize for new algorithms for mining that data. The release of user data to the public or among private parties is inevitable given the value and uses of such data.
Given the trend towards release of user data, user privacy has become an important concern. Users are made uncomfortable by the prospect of having so much of their personal information being shared with various, often unidentified, third parties.
Privacy preserving data publishing (PPDP) is a field of technical research that focuses on manipulating a user dataset to create greater user anonymity while still maintaining the value of the dataset. Using PPDP techniques, a data publisher might “anonymize” a dataset and release the anonymized dataset to a third party rather than the original data set. Thus, the recipient of the data may be able to use the data for meaningful data mining activities but cannot learn particularly private information about each user.
Various PPDP techniques have been developed. For example, one simple technique is to replace entities' names with anonymous identifiers (e.g., random numbers) or to remove such names altogether. Simply removing the names of the entities (e.g., users) is not enough in many cases. The resulting “anonymous” information may be correlated with other information to uniquely identify a user. For example, by knowing when a particular user has rented certain movies, it may be possible to identify that user from a movie rental company's data set.
Approaches such as K-anonymity have been used to solve this problem. K-anonymity aims to modify the database such that, for any given user record, the database contains at least K−1 records that are identical to the given user record. One method to achieve this K-anonymity is described in U.S. patent application Ser. No. 13/363,688, filed on Feb. 1, 2012 (the “'688 application”), which is incorporated by reference herein in its entirety. While this approach can effectively protect user privacy, the utility of the data may be decreased as each user record essentially represents an average person within the group of K people.