The present invention is directed toward the field of computer implemented clustering techniques, and more particularly toward methods and apparatus for divide and conquer clustering.
In general, clustering is the problem of grouping data objects into categories such that members of the category are similar in some interesting way. The field of clustering spans numerous application areas, including data mining, data compression, pattern recognition, and machine learning. More recently, with the explosion of the Internet and of information technology, xe2x80x9cdata streamxe2x80x9d processing has also required the application of clustering. A xe2x80x9cdata streamxe2x80x9d is an ordered sequence of data points that can only be read once or a small number of times. Some applications producing data streams are customer clicks (on a web site, for example), telephone records, multimedia data, web page retrievals and so on whose data sets are too large to fit in a computer main memory and must be stored first prior to clustering being applied.
The computational complexity of the clustering problem is very well understood. The existence of an efficient optimum clustering algorithm is unlikely, i.e., clustering is xe2x80x9cNP-hardxe2x80x9d. Conventional clustering methods thus seek to find approximate solutions.
In general, conventional clustering techniques are not designed to work with massively large and dynamic datasets and thus, do not operate well in the context of say, data mining and data stream processing. Most computer implemented clustering methods are based upon reducing computational complexity and often require multiple passes through the entire dataset. Thus, if the dataset is too large to fit in a computer""s main memory, the computer must repeatedly swap the dataset in and out of main memory (i.e., the computer must repeatedly access an external data source, such as a hard disk drive). Furthermore, for data stream applications, since the data exceeds the amount of main memory space available, clustering techniques should not have to track or remember the data that has been scanned. The analysis of the clustering problem in the prior art has largely focused on its computational complexity, and not on reducing the level of requisite input/output (I/O) activity. When implementing the method in a computer, there is a significant difference (often by a factor of 106) in access time between accessing internal main memory and accessing external memory, such as a hard disk drive. As a result, the performance bottleneck of clustering techniques that operate on massively large datasets is often due to the I/O latency and not the processing time (i.e., the CPU time).
The I/O efficiency of clustering techniques under different definitions of clustering has also been studied. Some techniques are based on representing the dataset in a compressed fashion based on how important a point is from a clustering perspective. For example, one conventional technique stores those points most important in main memory, compresses those that are less important, and discards the remaining points. Another common conventional technique to handle large datasets is sampling. For example, one technique illustrates how large a sample is needed to ensure that, with high probability, the sample contains at least a certain fraction of points from each cluster. The sampling-based techniques apply a clustering technique to the sample points only. Other techniques compress the dataset in unique ways. One technique, known popularly as Birch, involves constructing a tree that summarizes the dataset. The dataset is broken into subclusters and then each subcluster is represented by the mean (or average value) of the data in the subcluster. The union of these means is the compressed dataset. However, Birch requires many parameters regarding the data that must be provided by a knowledgable user, and is sensitive to the order of the data. Generally speaking, all these typical approaches do not make guarantees regarding the quality of the clustering.
Clustering has many different definitions of quality, and for each definition, a myriad of techniques exist to solve or approximately solve them. One definition of clustering quality is the so-called xe2x80x9ck-medianxe2x80x9d definition. The k-median definition is as follows: find k centers in a set of n points so as to minimize the sum of distances from data points to their closest cluster centers. A popular variant of k-median finds centers that minimize the sum of the squared distances from each point to its nearest center. xe2x80x9ck-centerxe2x80x9d clustering is defined as minimizing the maximum diameter of any cluster, where the diameter is the distance between the two farthest points within a cluster. Most techniques for implementing k-median and similar clustering have large space requirements and involve random access to the input data.
Accordingly, it is desirable to develop a clustering technique with quality of clustering guarantees that operates on massively large datasets for efficient implementation in a computer.
What is disclosed is a technique that uses a weighted divide and conquer approach for clustering a set S of n data points to find k final centers. The technique comprises 1) partitioning the set S into P disjoint pieces S1, . . . , SP; 2) for each piece Si, determining a set Di of k intermediate centers; 3) assigning each data point in each piece Si to the nearest one of the k intermediate centers; 4) weighting each of the k intermediate centers in each set Di by wi the number of points in the corresponding piece Si assigned to that center; and 5) clustering the weighted intermediate centers together to find said k final centers, the clustering performed using a specific quality metric and a clustering method A.
In some embodiments of the invention, P is chosen such that each piece Si obeys a constraint |Si| less than M, where M is the size of a memory or portion thereof. In other embodiments, weighted intermediate centers are found for each piece Si in a parallel fashion. In some embodiments, sets of data points can be incrementally added to already clustered data sets by finding weighted intermediate centers for the incremental set, and then clustering those weighted intermediate centers with weighted intermediate centers found from the previously clustered data sets.