Nowadays when the world is rapidly flooded with web documents, in an existing information search system, a long list of search results obtained in response to an inquiry from a user requires a lot of time and effort in order to arrange multiple pieces of information and acquire useful knowledge.
Accordingly, a clustering technique for processing search results appropriate for requirements from the user and then acquiring relations between the search results and unexpected useful knowledge has appeared as one problem solving method. The term “clustering” refers to grouping a large amount of data into groups of similar data and automatically classifying the groups of similar data according to a particular subject. When the user requires a search for particular information, the clustering technique enables a search for only documents within a cluster corresponding to a subject which is closest to the request from the user, instead of searching for all documents. Accordingly, the use of the clustering technique can save time required to search for information, and can improve search efficiency.
A k-means algorithm which is most frequently used among the clustering techniques is as follows. First, when the user determines the number of clusters as k, the k-means algorithm is used to group found points until a center is not changed in such a manner as to repeat a process for determining a center of each cluster, finding points which are close to the center of each cluster, and then again renewing the center of each cluster by using the found points.
FIG. 1 illustrates two-dimensional data on X and Y coordinates. With respect to the two-dimensional data on X and Y coordinates as illustrated in FIG. 1, it can be intuitively determined that the two-dimensional data on X and Y coordinates is ideally divided into three clusters which have c1, c2 and c3 as their centers.
However, actual data has three dimensions or more, and a case in which a clustering result can be intuitively determined as in the graph illustrated in FIG. 1 is seldom found. In order to analyze most clustering results which cannot be intuitively determined as described above, a silhouette coefficient can serve as an index which can verify the significance of a clustering result.
The silhouette coefficient needs to calculate a distance between target data and each of all data except for the target data. Accordingly, when the number of data is equal to n, an (n−1) number of calculations are required per data, and distances are symmetrical. In this regard, a total n(n−1)/2 number of calculations are required. In other words, a computational complexity is proportional to the square of a data size.
Accordingly, in the case of massive data, the number of calculations becomes larger and thus a typical methodology is inappropriate for verifying the significance of a clustering result.
Also, simple average calculations which calculate an average distance between a cluster to which the target data belongs and clusters to which the target data does not belong is required by as many as the number of clusters. In the case of a result of clustering the massive data, one computer cannot actually perform a task while loading all data into a memory of the one computer.