Privacy preserving data mining has become an important issue in recent years due to the large amount of consumer data tracked by automated systems on the Internet. The proliferation of electronic commerce on the World Wide Web has resulted in the storage of large amounts of transactional and personal information about users. In addition, advances in hardware technology have also made it feasible to track information about individuals from transactions in everyday life.
For example, a simple transaction such as using a credit card results in automated storage of information about user buying behavior. In many cases, users are not willing to supply such personal data unless its privacy is guaranteed. Therefore, in order to ensure effective data collection, it is important to design methods which can mine the data with a guarantee of privacy. This has resulted in a considerable amount of focus on privacy preserving data collection and mining methods in recent years.
Privacy preserving data mining approaches may essentially be considered one of two types: (1) privacy determination using a single server; and (2) distributed privacy preserving data mining.
(1) Privacy Determination Using a Single Server.
In this approach, users are not willing to share their data with the server which stores their data. A recent approach to privacy preserving data mining of this kind of data has been a perturbation-based technique. Users are not equally protective of all values in the records. Thus, users may be willing to provide modified values of certain fields by the use of a (publicly known) perturbing random distribution. This modified value may be generated using custom code or a browser plug-in. Data mining problems do not necessarily require the individual records, but only distributions. Since the perturbing distribution is known, it can be used to reconstruct aggregate distributions. This aggregate information may be used for the purpose of data mining algorithms.
It is to be noted that the perturbation approach results in some amount of information loss. The greater the level of perturbation, the less likely it is that the data distributions are estimated effectively. On the other hand, larger perturbations also lead to a greater amount of privacy. Thus, there is a natural trade-off between greater accuracy and loss of privacy.
(2) Distributed Privacy Preserving Data Mining.
In this kind of privacy preservation, the users are willing to share the records with their individual servers, but not with other servers. In many cases, it may be desirable to find a way to mine the aggregate data across the different servers. An example of such a case is the situation in which different competing businesses do not wish to share their competitive data, but they do wish to cooperate to the extent that aggregate data across different servers is shared. This situation can often arise in a retail environment in which different competing entities may desire to find aggregate information about market basket transactions. Unfortunately, existing techniques do not provide suitable ways to mine the aggregate data across the different servers.
Accordingly, it would be highly desirable to provide techniques for use in accordance with a distributed privacy preserving data mining approach.