During the normal course of business, companies accumulate large amounts of data. Recently, some companies have begun to monetize this data by sharing their data with third parties, such as advertisers, researchers, or collaborative partners. The third parties pay a certain monetary fee and in exchange, receive relevant data from a data owner. The third party can then use the data to target advertising or conduct research. However, the data requested by the third parties often includes information that is private to one or more individuals from whom the data is collected.
In one example, a hospital maintains patient records that include patient identification, age, residential address, social security number, and medical diagnosis. A third party conducting research on diabetes wants to identify regions of the United States that have the most and the least number of Type II diabetes diagnoses for patients below 40 years of age. Prior to sending the requested data, the data owner must ensure that the data to be provided does not allow an untrusted third party to access an individual's private information or determine an individual's identity.
Data anonymization includes the altering of data to protect sensitive information while maintaining features that allow a requesting third party to use the data. The data altering can include adding noise, reducing precision of the data, or removing parts of the data itself. Generally, data owners do not have enough knowledge regarding anonymization and thus, rely on third parties to anonymize their data prior to providing the data to a third party. One approach includes contacting an anonymization service provider that provides individual personnel to help with the data anonymization. The personnel assigned to the anonymization has access to the data despite being an untrusted third party. Currently, many companies ask the anonymization service to sign confidentiality agreements, such as a Memorandum of Understanding or a Non-Disclosure Agreement to protect the data prior to and after the data is anonymized.
Conventional methods for performing data anonymization exist, but fail to address the issue of anonymization by an untrusted third party. In U.S. Pat. No. 7,269,578, to Sweeney, entries of a table are altered based on user specifications, such as specific fields and records, a recipient profile, and a minimum anonymity level. A value for k is computed and quasi-identifiers, which are k tuples that have the same values assigned across a group of attributes, are identified for release. A sensitivity of each attribute is determined and a replacement strategy is determined for each sensitive attribute, such as equivalence class substitution, including one-way hashing, or generalized replacement. Generalized replacement includes identifying the attribute with the largest number of distinct values and generalizing each value for that attribute by reducing the amount of information provided for that value. For example, dates having a month, day and year can be generalized to month and year, year only, or range of years. However, Sweeney fails to consider that the anonymization may be performed by an untrusted party and thus, provides no protection of the data to be anonymized. Further, Sweeny fails to identify a number of classes into which a data set to be anonymized should be divided and to anonymize each data value based on the class in which that data value belongs.
Further, the paper titled “Mondrian Multidimensional K-Anonymity,” by LeFevre et al., describes partitioning a dataset, using single-dimensional or multidimensional partitions, such that each region includes k or more points. In one example, the partitioning can occur using median partitioning. However, the LeFevre paper fails to describe steps for protecting the data prior to anonymization, in the event an untrusted third party anonymizes the data. In addition, LeFevre fails to provide measures of data sensitivity to automatically identify attributes for anonymization and further fails to consider masking.
Therefore, there is a need for an approach to making sensitive data available for third party anonymization without compromising the privacy of individuals from whom the data is collected.