Privacy preserving data mining has become important due to the large amount of personal and consumer data tracked by automated systems on the Internet. The proliferation of electronic commerce on the World Wide Web has resulted in the storage of large amounts of transactional and personal user information. In addition, advances in hardware technology have made it technologically and economically feasible to track information about individuals from transactions in everyday life. For example, a simple transaction, such as using a credit card, results in automated storage of information about a user's buying behavior. The underlying data may consist of demographic information and specific transactions. It may not be desirable to share such information publicly, therefore, users are unwilling to provide personal information unless the privacy of sensitive information is guaranteed. In order to ensure effective data collection, it is important to design methods which can mine the necessary data with a guarantee of privacy.
The nature of privacy in the context of recent trends in information technology has been a subject of note among many authors, see, e.g., articles such as C. Clifton et al., “Security and Privacy Implications of Data Mining,” ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pp. 15-19, May 1996; L. F. Cranor, “Special Issue on Internet Privacy,” Communications of the ACM, 42(2), February 1999; “The End of Privacy,” The Economist, May 1999; K. Thearling, “Data Mining and Privacy; A Conflict in Making,” March 1998; “The Death of Privacy,” Time, August 1997; and J. M. Reagle Jr. et al., “P3P and Privacy on the Web,” The World Wide Web Consortium, http://www.w3.org/P3P/P3FAQ.html, April 2000. This interest has resulted in a considerable amount of focus on privacy preserving data collection and mining methods, see, e.g., articles such as D. Agrawal et al., “Privacy Preserving Data Mining,” Proceedings of the ACM SIGMOD Conference, 2000; P. Benassi, “Truste: An Online Privacy Seal Program,” Communications of the ACM, 42(2):56-59, 1999; V. Estivill-Castro et al., “Data Swapping: Balancing Privacy Against Precision in Mining for Logic Rules,” Data Warehousing and Knowledge Discovery DaWak99, pp. 389-398; A. Evfimievski et al., “Privacy Preserving Mining of Association Rules,” ACM KDD Conference, 2002; C. K. Liew et al., “A Data Distortion by Probability Distribution,” ACM TOD, 10(3):395-411, 1985; T. Lau et al. “Privacy Interfaces for Information Management,” Communications of the ACM, 42(10):88-94, October 1999; and J. Vaidya, “Privacy Preserving Association Rule Mining in Vertically Partitioned Data,” ACM KDD Conference, 2002.
In order to preserve privacy in data mining operations a perturbation approach has typically been utilized. This technique reconstructs data distributions in order to perform the mining by adding noise to each dimension, thus treating each dimension independently. Therefore, the technique ignores the correlations between the different dimensions making it impossible to reconstruct the inter-attribute correlations in the data set. In many cases, relevant information for data mining methodologies, such as classification, is hidden in the inter-attribute correlations, see, e.g., S. Murthy, “Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey,” Data Mining and Knowledge Discovery, pp. 345-389, 1998.
An existing data mining technique uses a distribution-based analog of a single-attribute split methodology, (see, e.g., R. Agrawal et al.). This technique does not use the multidimensional records, but uses aggregate distributions of the data as input, leading to a fundamental redesign of data mining methodologies. Other techniques such as multi-variate decision tree methodologies, (see, e.g., S. Murthy), cannot be modified to work with the perturbation approach due to the independent treatment of the different attributes. Therefore, distribution based data mining methodologies have an inherent disadvantage in the loss of implicit information available in multidimensional records. It is difficult to extend the technique to reconstruct multi-variate distributions, because the amount of data required to estimate multidimensional distributions (even without randomization) increases exponentially with data dimensionality, see, e.g., B. W. Silverman, “Density Estimation for Statistics and Data Analysis,” Chapman and Hall, 1986. This is often not feasible in many practical problems because of the large number of dimensions in the data.
Thus, a need exists for improved privacy preserving data mining techniques, which overcome these and other limitations.