1. Field of the Invention
The present invention relates to systems and methods of data storage and retrieval, and in particular to a method and system for clustering data in a multiprocessor system.
2. Description of the Related Art
The ability to manage massive amounts of information in large scale databases has become of increasing importance in recent years. Increasingly, data analysts are faced with ever larger data sets, some of which measure in gigabytes or even terabytes. One way to increase the efficiency of the use of such databases is through the use of data mining. Data mining involves the process or processing masses of data to uncover patterns and relationships between data entries in the database. Data mining may be accomplished manually by slicing and dicing the data until a data pattern emerges, or it can be accomplished by data mining programs.
Clustering is a commonly used procedure in data mining algorithms. Practical applications of clustering include unsupervised classification and taxonomy generation, nearest neighbor searching, scientific discovery, vector quantization, text analysis, and navigation.
The k-means algorithm is a popular procedure for clustering data sets. This procedure assumes that the data "objects" to be clustered are available as points (or vectors) in a d-dimensional Euclidean space. The k-means algorithm seeks a minimum variance grouping of data that minimizes the sum of squared Euclidean distances from certain cluster centroids. The popularity of the k-means algorithm can be attributed to its relative ease of interpretation, implementation simplicity, scalability, convergence speed, adaptability to sparse data, and ease of out-of-core (out of the local memory of a single processor) implementation.
While the k-means algorithm is effective, it is no panacea for large databases like those of text documents and customer market data, which often include millions of data points. Applying the k-means algorithm in such cases can result in unacceptably long processing times and can exhaust the memory capacity of the processor implementing the algorithm. The use of non-volatile memory devices such as hard disks for virtual memory solves the memory problem, but at very high throughput cost. What is needed is a clustering algorithm and an apparatus for implementing that algorithm that allows for the rapid processing of large databases. The present invention satisfies that need.