1. Field of the Invention
The present invention relates to mining data from Internet users while preserving the privacy of the users.
2. Description of the Related Art
The explosive progress in computer networking, data storage, and processor speed has led to the creation of very large data bases that record enormous amounts of transactional information, including Web-based transactional information. Data mining techniques can then be used to discover valuable, non-obvious information from large databases.
Not surprisingly, many Web users do not wish to have every detail of every transaction recorded. Instead, many Web users prefer to maintain considerable privacy. Accordingly, a Web user might choose not to give certain information during a transaction, such as income, age, number of children, and so on.
It happens, however, that data mining of Web user information is not only useful to, e.g., marketing companies, but it is also useful in better serving Web users. For instance, data mining might reveal that people of a certain age in a certain income bracket might prefer particular types of vehicles, and generally not prefer other types. Consequently, by knowing the age and income bracket of a particular user, an automobile sales Web page can be presented that lists the likely vehicles of choice to the user, before other types of vehicles, thereby making the shopping experience more relevant and efficient for the user. Indeed, with the above in mind it will be appreciated that data mining makes possible the filtering of data to weed out unwanted information, as well as improving search results with less effort. Nonetheless, data mining used to improve Web service to a user requires information that the user might not want to share.
As recognized herein, the primary task of data mining is the development of models about aggregated data. Accordingly, the present invention understands that it is possible to develop accurate models without access to precise information in individual data records. Surveys of Web users indicate that the majority of users, while expressing concerns about privacy, would willingly divulge useful information about themselves if privacy measures were implemented, thereby facilitating the gathering of data and mining of useful information. The present invention has carefully considered the above considerations and has addressed the noted problems.
The invention is a general purpose computer programmed according to the inventive steps herein to mine data from users of the Internet while preserving their privacy. The invention can also be embodied as an article of manufacturexe2x80x94a machine componentxe2x80x94that is used by a digital processing apparatus and which tangibly embodies a program of instructions that are executable by the digital processing apparatus to undertake the present invention. This invention is realized in a critical machine component that causes a digital processing apparatus to perform the inventive method steps herein. The invention is also a computer-implemented method for undertaking the acts disclosed below.
Accordingly, a computer-implemented method is disclosed for generating a classification model based on data received from at least one user computer via the Internet while maintaining the privacy of a user of the computer. The method includes perturbing original data associated with the user computer to render perturbed data, and reconstructing an estimate of a distribution of the original data using the perturbed data. Then, using the estimate, at least one Naive Bayes classification model is generated.
Preferably, perturbed data is generated from plural original data associated with respective plural user computers. At least some of the data can be perturbed using a uniform probability distribution or using a Gaussian probability distribution. Furthermore, at least some of the data is perturbed by selectively replacing the data with other values based on a probability.
As set forth in detail below, the reconstruction act includes iteratively determining a density function. The particularly preferred reconstructing act includes partitioning the perturbed data into intervals, and iteratively determining a partition probability using the intervals. Preferably, this is iteratively undertaken by determining classes of the data, and determining a class probability using training data. Also, a probability of a record given a class is determined using the training data. Then, a probability of a class given a record is determined based on the class probability and the probability of a record given a class. In one preferred embodiment, the training data is the estimate, and the reconstructing act is undertaken before the act of generating the classifier.
In another aspect, a computer system includes a program of instructions that in turn includes structure to undertake system acts which include, at a user computer, randomizing at least some original values of at least some numeric attributes to render perturbed values. The program also sends the perturbed values to a server computer. At the server computer, the perturbed values are processed to generate at least one Naive Bayes classification model.
In still another aspect, a system storage device includes system readable code that is readable by a server system for generating at least one Naive Bayes classification model based on original data values stored at plural user systems without knowing the original values. The device includes logic means for randomizing at least some original values of at least some numeric attributes to render perturbed values. Also, the device includes logic means for sending the perturbed values to a server computer. Moreover, logic means are available to the server computer for processing the perturbed values to generate at least one Naive Bayes classification model.
The details of the present invention, both as to its structure and operation, can best be understood in reference to the accompanying drawings, in which like reference numerals refer to like parts, and in which: