There has been the general notion of performing the data clustering in parallel by more than one computer to increase the efficiency of the data clustering. This is particularly important as the data sets increase in the number of data points that need to be clustered, and the case of naturally distributed data (e.g., data for an international company with offices in many different offices located different countries and locations). Unfortunately, the known data clustering techniques were developed for execution by a single processing unit. Furthermore, although there have been attempts to make these known data clustering techniques into parallel techniques, as described in greater detail hereinbelow, these prior art approaches to formulate a parallel data clustering technique offer only tolerable solutions, each with its own disadvantages, and leaves much to be desired.
One prior art approach proposes a parallel version of the K-Means clustering algorithm. The publication entitled, “Parallel Implementation of Vision Algorithms on Workstation Clusters,” by D. Judd, N. K. Ratha, P. K. McKinley, J. Weng, and A. K. Jain, Proceedings of the 12th International Conference on Pattern Recognition, Jerusalem, Israel, October 1994 describes parallel implementations of two computer vision algorithms on distributed cluster platforms. The publication entitled, “Performance Evaluation of Large-Scale Parallel Clustering in NOW Environments,” by D. Judd, P. K. McKinley, and A. K. Jain, Proceedings of the Eighth SIAM Conference on Parallel Processing for Scientific Computing, Minneapolis, Minnesota, March 1997 further presents the results of a performance study of parallel data clustering on Network of Workstations (NOW) platforms.
Unfortunately, these publications do not formalize the data clustering approach. Furthermore, the procedure for K-Means is described in a cursory fashion without explanation of how the procedure operates. Also, the publications are silent about whether the distributed clustering technique can be generalized, and if so, how the generalization can be performed, thereby limiting the applicability of the Judd approach to K-Means data clustering.
Another prior art approach proposes non-approximated, parallel versions of K-Means. The publication, “Parallel K-means Clustering Algorithm on NOWs,” by Sanpawat Kantabutra and Alva L. Couch, NECTEC Technical Journal, Vol. 1, No. 1, March 1999 describes an example of this approach. Unfortunately, the Kantabutra and Couch algorithm requires re-broadcasting the entire data set to all computers for each iteration. Consequently, this approach may lead to heavy congestion in the network and may impose a communication overhead or penalty. Since the trend in technology is for the speed of processors to improve faster than the speed of networks, it is desirable for a distributed clustering method to reduce the amount of data that needs to be communicated between the computers in the network.
Furthermore, the number of slave computing units in this algorithm is limited to the number of clusters to be found. Also, an analytical and empirical analysis of this approach estimates a 50% utilization of the processors. It would be desirable for a distributed clustering method that has a greater percentage of utilization of the processors.
Accordingly, there remains a need for a method and system for data clustering that can utilize more than one computing unit for concurrently processing the clustering task and that overcomes the disadvantages set forth previously.