Valuable research, whether in an academic or business context, is often dependent on the availability of large quantities of structured data (e.g., data corresponding to various entities that are grouped together and that share common attributes, and where the data is organized according a defined schema) for data mining and other analyses. However, such data often includes personal information about individuals that should not be disclosed. The tension between protecting privacy and preserving data utility is a fundamental problem for organizations that would like to share their data. Where this problem is not resolved, data is either not shared, preventing useful applications, or organizations adopt risky practices of disclosing private information, sometimes with unfortunate results. One approach to this problem is to “sanitize” the data by modifying any data that may be used to identify individuals in such a way that it becomes difficult for an adversary to associate any given record with a specific individual.
Most practical approaches to sanitizing data can be grouped into two large categories, algorithms based on so-called K-anonymity and randomization (the latter being often referred to as noise perturbation). K-anonymity approaches modify any potentially identifying information in such a way that a given individual's record cannot be distinguished from at least k other records in the structured data. While such techniques achieve a desired level of privacy, K-anonymity based tradeoffs between privacy and distortion inevitably reduce to difficult combinatorial optimization problems. Additionally, due to the use of generalization and suppression operators, the output of K-anonymization is data with a changed representation (e.g., zip codes with digits removed or deleted attributes), thereby complicating the construction of models that will be applied to clean data, or running test code which must now be altered to run on the altered data. Further still, the statistical effect of a K-anonymity anonymization process is not clear, thereby making data analysis challenging.
In randomization, the structured data is corrupted by noise in an effort to conceal specific data values. An advantage of randomization is that the noise can be chosen with statistical properties (which properties may be subsequently published) such that aggregate queries against the structured data can account for the added noise, thereby increasing the accuracy and reliability of the aggregate results without compromising individual privacy. Furthermore, representation of the data is preserved (e.g. an age is mapped to a specific number, as opposed to an age range in the case of k-anonymity-based approaches). However, while randomization (using current techniques) preserves utility of the data, it cannot make assurances concerning the privacy level of the published data.
Indeed, in some cases, it may be possible to attack randomized structured data based on publicly available information to associate specific records with specific individuals, i.e., a linking attack. An example of this is illustrated in FIGS. 6-8. FIG. 6 illustrates structured data comprising a name, age and procedure/prescription attribute for each record prior to anonymization. As shown in FIG. 7, in order to preserve anonymity of each individual listed in the records, the name attribute is removed completely whereas the age attribute is perturbed by random noise to provide sanitized data. FIG. 8 illustrates publicly available information about two of the individuals, namely their name and age. In the linking attack, the attacker attempts to associate some attributes from the publicly available information with the same attributes in the sanitized data. In the illustrated example, the attacker can note that Chris' age is 52, which matches well with the outlier age value in FIG. 7 of 49.3 when compared with the other sanitized age values. As a result, the attacker can infer that Chris is associated with the third record (i.e., the one listing chemotherapy as the procedure/prescription).
Thus, it would be advantageous to provide techniques that provide the ability to balance and control privacy versus distortion performance when anonymizing structured data, thereby preserving utility of the data while simultaneously providing a known level of privacy.