Commercial databases have become a source of information for users in decision making of various types. It is useful, for example, in predicting consumers future buying habits to have access to data concerning their past buying behavior. As the size of these databases has grown, extracting useful information can become very difficult. An entire field known as data mining has emerged to enable users to access and interpret the data contained in large databases.
In many data mining problems, a goal is to make a rational decision given the information contained in a large amount of data. Presenting a visual depiction of the data for a human to enable he or she to make such decisions is one such problem. Automatically making many decisions is another. The large corpus of data can be the records of all customers' transactions in a grocery store chain with automated registers or an online bookstore having a huge inventory. Other examples might constitute records of all news stories read by the viewers on an online news site. The news site administrator might want to predict what stories would interest the viewer given what he or she has already read, and what advertisements to place on a web page given the advertisements the user has already clicked and the stories he or she has read. Or the store manager might want to know what customers in which demographic categories buy which items.
Although many algorithms for such problems are known and widely used (for example, Decision Trees and K-Means Clustering), they take too much time if trained on too much data. It has been observed that under certain circumstances, however, it may not be necessary to use an entire database (which can have many millions of records) to create a useful model or predictor. Instead a sample of a few tens of thousands of records might accurately represent the much larger data set of the entire database.
U.S. Pat. No. 6,012,058 to Fayyad et al., which issued Jan. 4, 2000 discloses one data mining process for clustering data. The disclosure of this patent is incorporated herein by reference. This patent discloses a clustering process that extracts sufficient statistics concerning a large database to produce a data clustering model that takes up far less memory than the entire database.