As the volume of information in databases grows, there are ongoing efforts to more effectively utilize this voluminous information. On such technique is referred to as clustering or segmentation. A prevalent clustering technique is referred to as the K-Means algorithm. The K-Means algorithm divides a data set into K clusters through an iterative process. The iterative process identifies the best centroids in the data to efficiently select the K clusters.
There are ongoing efforts to improve the computational efficiency of computer implemented K-Means modules. Most of these efforts are directed toward the execution of the K-Means algorithm after an initial set of clustering points has been selected. These approaches ignore an important factor in the overall efficiency of the K-Means technique. That is, the results of a K-Means clustering analysis are frequently dependent upon the choice of initial clustering points for the K clusters. Therefore, a poor selection of clustering points can result in excessive computations and a non-robust solution.
Some techniques rely upon the first K data points as the cluster points or “seeds”, other algorithms choose widely spaced records in case the records have a meaningful order, e.g. use record numbers int(i*n/k), where i=1, . . . ,k and n is the number of data records. There are drawbacks with both of these methods.
Accordingly, it would be highly desirable to provide an improved technique for clustering data. More particularly, it would be highly desirable to provide an improved clustering analysis through the efficient and intelligent selection of initial clustering points.