The present invention relates to data mining, and, in particular, to preserving privacy in data mining.
Preserving privacy in data mining has been an important consideration in recent years because of many new kinds of technology that facilitate the collection of different kinds of data. Such large collections of data have lead increasingly to a need to develop methods for protecting the privacy of underlying data records. As a result, a considerable amount of research has been focused on this problem in recent years. However, most of this research has focused on the preserving privacy for quantitative and categorical data.
The techniques proposed for preserving privacy for quantitative aid categorical data have been useful for different scenarios of privacy. Though both techniques work well for low dimensional data, they and are not very effective for preserving privacy for high dimensional data.
In the high dimensional data case, the concept of locality becomes ill defined. Since the concept of anonymity depends deeply upon locality, it is not possible to make the data anonymous, i e., “anonymize” the data, without losing an unacceptable amount of information. Furthermore, as the number of attributes increases, the problem of anonymity becomes increasing difficult. Since it has been shown that this problem is NP-hard, i.e., it cannot be optimally solved in a reasonable amount to time, it also became impractical to anonymize the data.
In the method of perturbation, it is possible to compute maximum likelihood estimates for records matching a public database. With increasing dimensionality, however, these estimates become increasingly accurate, and therefore privacy is lost.
Recently, research has been directed to preserving privacy via pseudo-random sketches. The techniques have been designed specifically for the problem of query resolution in quantitative data sets, not for high dimensional data sets. Such techniques do not work effectively for preserving privacy in high dimensional data sets.
There is thus a need for a technique for preserving privacy data mining in high dimensional data sets.