1. Field of the Invention
The present invention relates to managing data, and more particularly to a system and method for transforming data in a manner that satisfies predetermined privacy constraints.
2. Discussion of the Related Art
Information is a corner stone of fields as diverse as medical care and retail sales. For example, information about a hospital patient can include date of birth, social security number, address, next of kin, and medical diagnosis. Consumer profiles collected by businesses/organizations can include identifying and transactional information pertinent to the organization. The amount of information, and in particular the sensitivity of portions of the information, can be a concern for those represented by the information, e.g., consumers.
The information is frequently shared with different parts of the same organization or with other organizations. For example, some portions of medical data may be made part of a public record or shared with public health organizations or with research groups. The information can be used as a commodity, where organizations are willing to pay for the information. Dissemination of the data can be made more agreeable to the individuals and entities represented by the data if some rules governing the dissemination are in place. For example, one proposed method of controlling the dissemination of data constrains the data to a portion of identifying information, which when combined identifies a set of at least k individuals or entities (where k represents the level of privacy). For example, if k is 100, this implies that 100 or more individuals or entities can be described by that same information. Of course, such a privacy requirement can be met by removing all the identifying information but that can also render the data useless. The problem is to satisfy the privacy constraint while retaining useful information. The task of transforming a table to satisfy a privacy constraint is also called anonymization.
At least one proposed method satisfies the privacy constraint by abstracting or suppressing the data selectively. For example, in tabular data where each row represents an individual, and one or more columns comprise explicitly identifying information (e.g., social security numbers), the identifying information can be suppressed or replaced by some randomized value as a place holder.
The notion of a privacy constraint, called k-anonymity, has been formally defined by P. Samarati in the paper “Protecting Respondents Identities in Microdata Release” in the IEEE Transactions on Knowledge and Data Engineering, Vol 13., No. 6, November/December 2001. This specifies that a tabular data set satisfies k-anonymity if any combination of values in any row of the table for the identifying columns appears at least k times in the table. Samarati also defines the operation of abstraction for any potentially identifying column by using a taxonomy of the values for the column. For example, the date of birth can be specified exactly or it could be specified down to the month and year only, or it can be specified to the year of birth only. The information gets less precise as the values are abstracted further. Outliers in the table can also be suppressed. The information in the potentially identifying columns for a suppressed row can be completely masked out. Samarati declares that allowing both abstractions and suppressions has more flexibility leading to better solutions. The information loss due to abstraction is measured using the taxonomy tree for each of the potentially identifying columns. The allowed abstractions are those corresponding to nodes (and corresponding values) all of which are at some chosen level of the taxonomy tree. The difference between the chosen abstraction level and the level corresponding to the leaf nodes of the tree is used to measure the loss of information. The anonymization task is treated as an optimization problem to achieve the privacy level with minimum loss of information (measured using the taxonomy tree as mentioned above) and with no more than a specified number of rows suppressed.
The Datafly system described in DataFly: A System for Providing Anonymity in Medical Data, Proceedings of Database Security XI: Status and Prospects (Chapman and Hall) by Latanya Sweeney is an example of a system that uses abstraction and suppression to perform anonymization to achieve a specified k-anonymity level of privacy. The Datafly system also uses the notion of the taxonomy trees for the potentially identifying columns. The optimization problem is supposedly solved using a simple greedy heuristic that does not explore the full range of possible solutions.
Another system that uses abstractions and suppressions to anonymize is the μ-argus system described by A. J. Hundepool and L. C. R. J. Willenborg in Software for Statistical Disclosure Control presented at the 3rd International Seminar on Statistical Confidentiality, 1996. Since only 2- and 3-combinations of potentially identifying column values are considered the solutions produced by this system can be less than optimal.
An alternative approach to the use of abstractions and suppressions is to disturb the data ensuring that some statistical properties are satisfied. The work by Agrawal and Srikant, Privacy-Preserving Data Mining, in the Proceedings of the ACM SIGMOD Conference on Management of Data, May 2000, is a work of this type. The distortion method compromises the content of each specific piece of data (each row in the table). This precludes deriving any insights on relationships between the column values of any row once the values are distorted.
However, no system or method is known to exist for applying a targeted transformation according to a desired implementation. Therefore, a need exists for a system and method for transforming data according to a predetermined privacy constraint.