1. Field of the Invention
The present invention is directed toward the field of computer implemented clustering techniques, and more particularly toward methods and apparatus for divide and conquer clustering.
2. Art Background
In general, clustering is the problem of grouping objects into categories such that members of the category are similar in some interesting way. Literature in the field of clustering spans numerous application areas, including data mining, data compression, pattern recognition, and machine learning. The computational complexity of the clustering problem is very well understood. The general problem is known to be NP hard.
The analysis of the clustering problem in the prior art has largely focused on the accuracy of the clustering results. For example, there exist methods that compute a clustering with maximum diameter at most twice as large as the maximum diameter of the optimum clustering. Although these prior art clustering techniques generate close to optimum results, they are not tuned for implementation in a computer, particularly when the dataset for clustering is large. Accordingly, it is desirable to develop a clustering technique that maximizes the computer implementation efficiency even at the cost of clustering results.
In general, prior art clustering methods are not designed to work with massively large and dynamic datasets. Most computer implemented clustering methods require multiple passes through the entire dataset. Thus, if the dataset is too large to fit in a computer""s main memory, the computer must repeatedly swap the dataset in and out of main memory (i.e., the computer must repeatedly access an external data source, such as a hard disk drive). The analysis of the clustering problem in the prior art has largely focused on its computational complexity, and not its input/output complexity. However, in implementing the method in a computer, there is a significant difference in access time between accessing internal main memory and accessing external memory, such as a hard disk drive. For example, loading a register requires approximately 10xe2x88x929 seconds while accessing data from the disk requires roughly 10xe2x88x923 seconds. Thus, there is about a factor of a million difference in the access time of internal vs. external memory. As a result, the performance bottleneck of clustering techniques that operate on massively large datasets is often due to the I/O communication and not the processing time (i.e., the CPU time). This impact of I/O communications is compounded by the fact that processor speed are increasing at an annual rate of approximately 40 to 60 percent, compared to the increase of approximately 7 to 10 percent for disk transfer rates.
The I/O efficiency of clustering methods under different definitions of clustering has been studied. Some approaches are based on representing the dataset in a compressed fashion based on how important a point is from a clustering perspective. For example, one prior art technique stores those points most important in main memory, compresses those that are less important, and discards the remaining points. Another common prior art technique to handle large datasets is sampling. For example, one technique illustrates how large a sample is needed to ensure that, with high probability, the sample contains at least a certain fraction of points from each cluster. The sampling approach applies a clustering technique to the sample points only. Moreover, generally speaking, these prior art approaches do not make guarantees regarding the quality of the clustering. Accordingly, it is desirable to develop a clustering technique with quality of clustering guarantees that operates on massively large datasets for efficient implementation in a computer.
A divide and conquer method significantly improves input/output (I/O) efficiency in a computer. The divide and conquer method clusters a set of points, S, to identify K centroids. The set of points, S, are assigned into xe2x80x9crxe2x80x9d partitions, so as to uniquely assign each point into one partition. At least one of the subsets of points for a partition are stored into main memory of the computer. The computer processes the subset of points to generate a plurality of partition or divide centroids, Q, k for each of the r partitions. The divide centroids are merged into a set of partition centroids, and are stored in main memory of the computer. Thereafter, the partition centroids are processed by accessing the memory to generate a plurality of conquer centroids, c1, . . . , ck. The divide and conquer method is a data incremental as well as a feature incremental method. Also, the divide and conquer method permits parallel processing for implementation in a multi-processor computer system.