Herebelow, designations presented in square brackets—[ ]—are keyed to the list of references found towards the close of the present disclosure.
“Privacy preserving” data mining has arisen as an important challenge in recent years because of the large amount of personal data available among corporations and individuals. An important method that has been developed for privacy preserving data mining is that of k-anonymity [samarati]. The k-anonymity approach has been extensively explored in recent years because of its intuitive significance in defining the level of privacy. A primary motivation behind the k-anonymity approach is that public databases can often be used by individuals to identify personal information about users. For example, a person's age and zip code can be used for identification to a very high degree of accuracy. Therefore, the k-anonymity method attempts to reduce the granularity of representation of the data in order to minimize the risk of disclosure.
To achieve such a goal, methods of generalization and suppression are employed. In the method of generalization, the multi-dimensional values are generalized to a range. In addition, some attributes or records may need to be suppressed in order to maintain k-anonymity. At the end of the process, the data is transformed in such a way that a given record cannot be distinguished from at least (k−1) other records in the data. In such cases, the data is said to be k-anonymous, since it is not possible to map a given record to less than k-individuals in the public database. The concept of k-anonymity has been intuitively appealing because of its natural interpretability in terms of the degree of privacy.
At the same time, a “condensation-based” technique [edbt04] has been proposed as an alternative to k-anonymity methods. A key difference between condensation and k-anonymity methods is that the former works with pseudo-data rather than with original records. Because of the use of pseudo-data, the identities of the records are even more secure from inference attacks; the idea is to utilize statistical summarization, which is then leveraged in order to create pseudo-data. As such, the condensation approach includes the following steps:                Condensed groups of records are constructed. The number of records in each group is (at least) equal to the anonymity level k.        The statistical information in the condensed groups can be utilized to synthetically generate pseudo-data which reflects the overall behavior of the original data.        The condensed pseudo-groups can be utilized directly with minor modifications of existing data mining algorithms. Typically, such pseudo-data is useful in aggregation-based data mining algorithms which utilize the aggregate trends and patterns in the data rather than individual records.        
The condensation approach is very similar to the k-anonymity model since it guarantees that at least k records in the data cannot be distinguished from one another. At the same, since a one to one matching does not exist between the original and condensed data, it is more resistant to inference attacks. It is noted that the new data set need not even contain the same number of records as the original data set, as long as the records in different condensed groups are proportionately represented in the pseudo-data.
It has further been noted that k-anonymity methods have been developed for the case of multi-dimensional data, and do not work for the case of strings. The string domain is particularly important because of its applicability to a number of crucial problems for privacy preserving data mining in the biological domain. Recent research has shown that the information about diseases in medical data can be used in order to make inferences about the identity of DNA fragments. Many diseases of a genetic nature show up as specific patterns in the DNA of the individual. A possible solution is to anonymize the medical records, but this can at best provide a partial solution. This is because the information about DNA segments can be obtained from a variety of sources other than medical records. For example, identifying information can be obtained from a number of defining characteristics which are public information about the individual. Similarly, if DNA string fragments from a relative of a target are available, it can be used to identify the target. In general, it can be assumed that partial or complete information about the individual fragments of the strings is available. It may also be possible to have strings which are structurally related to the base strings in a specific way. Therefore, it is recognized as important to anonymize the strings in such a way that it is no longer possible to use these individual fragments in order to make inferences about the identities of the original strings.
In view of the foregoing, needs have been recognized in connection with improving upon the shortcomings and disadvantages of conventional efforts.