1. Field of the Invention
The present invention relates generally to data anonymization. More specifically, the present invention relates to a system and method for data anonymization using hierarchical data clustering and perturbation.
2. Related Art
In today's digital society, record-level data has increasingly become a vital source of information for businesses and other entities. For example, many government agencies are required to release census and other record-level data to the public, to make decision-making more transparent. Although transparency can be a significant driver for economic activity, care must to be taken to safeguard the privacy of individuals and to prevent sensitive information from falling into the wrong hands. To preserve privacy, record-level data must be anonymized so that no individual can be identified from the data.
Many methods have been proposed for anonymization of data. One method for the anonymization of census data, known as attribute suppression, involves not releasing attributes that may lead to identification. However, even if direct identifiers are removed, it is still possible to isolate individuals who have unique values for the combination of all released attributes. As such, it might be possible to identify certain individuals by linking the released data to externally available datasets.
One common metric for anonymization is known as k-anonymity. K-anonymity requires that each record is the same as at least k−1 other records with respect to certain identifying attributes. One method for achieving k-anonymity, known as generalization, involves replacing values for identifying attributes by more general values to achieve k-anonymity. Research groups have analyzed the computational complexity of achieving k-anonymity, and demonstrated that it is NP-hard. Some advanced methods for attaining k-anonymity include approximation algorithms to achieve k-anonymity, optimal k-anonymity, privacy enhancing k-anonymity in distributed scenarios, personalized privacy preservation, and multi-dimensional k-anonymity.
However, achieving k-anonymity by generalization is not feasible in cases of high-dimensional datasets because there are many attributes and unique combinations even after the generalization of some attributes. It has been shown using two simple attacks that a k-anonymized dataset has some subtle, but severe, privacy problems. A powerful privacy criterion called l-diversity has been proposed that can defend against such attacks. However, research shows that l-diversity has a number of limitations and is neither necessary nor sufficient to prevent attribute disclosure. A privacy approach referred to as t-closeness has been proposed, and requires that the distribution of a sensitive attribute in any equivalence class is close to the distribution of the attribute in the overall table.
Another approach for anonymization of data involves perturbation of an entire dataset by adding random noise or swapping the values of one record with another record. This ensures that even if a unique record is isolated, it may not correspond to any real person. However, this metric destroys the correlations among different attributes, which may cause statistical inferences from the data to no longer be valid.
Thus, a need exists for a system for data anonymization that can be applied to high-dimensional data sets while maintaining statistical information at different levels of the data.