1. Field of this Invention
The present invention relates to processing including document and pattern clustering.
2. Description of the Related Art
Document and pattern clustering are techniques for dividing an inputted document or pattern set into some groups according to the content or topics of documents or patterns. The clustering technique has been studied for a long time, and methods hitherto devised are systematically introduced in “Foundations of Statistical Natural Language Processing” (The MIT Press, 1999) written by C. D. Manning and H. Schutze. There are two clustering approaches. One, termed soft clustering, obtains the probability that each document or pattern belongs to each cluster. The other, termed hard clustering, determines whether or not each document or pattern belongs to each cluster. Hard clustering is further divided into hierarchical and non-hierarchical approaches. The hierarchical approach is further divided into bottom-up and top-down approaches. In the initial state of the hierarchical approach, each document or pattern becomes a seed of a cluster, and processing by merging the closest clusters is repeated. To merge the closest cluster, a document or pattern set is expressed in a tree structure. Known methods of measuring the degree of closeness (i.e., similarity) between clusters are the single link method, the complete link method, and the group average method. In each of these measurement methods, a calculation is performed based on the similarity between two documents or patterns. In the top down approach, from an initial state where all documents or patterns are in one cluster, processing is repeated. For example, if the lowest similarity in all document or pattern pairs in one cluster is less than a threshold value, the cluster is divided.
In the non-hierarchical method, a previously determined number of clusters are constructed to satisfy some standard. Typical processing steps in the non-hierarchical method are: step 1: randomly select a specified number of clustered documents or patterns and make them the centers of the respective clusters,
step 2: determine the distance between each document or pattern and the center of each cluster and make each document or pattern belong to the cluster closest to it,
step 3: determining the center of each cluster by averaging document or pattern vectors belonging to each cluster, and
step 4: perform the processing of step 2; if the cluster to which each document or pattern belongs has not changed, end the procedure; and if there has been a change, perform, step 3.
The conventional document and pattern clustering techniques have three serious problems. One problem is about the number of clusters to be obtained. In document or pattern clustering, the number of clusters to be obtained must be the same as the number of topics stated in the documents or patterns of an inputted document or pattern set. As described above, in the bottom-up hierarchical clustering processing, each cluster starts from a state including one document or pattern, merging the closest clusters is repeated until all documents or patterns are finally in one cluster. Accordingly, in order to obtain clusters whose number is same as the number of topics, it is necessary to stop cluster merging. This can be realized by not merging cluster pairs having a similarity lower than a threshold value. However, it is difficult to determine the threshold value. If the threshold value is inadequate, the correct number of clusters can not be obtained. Similarly, in top-down clustering processing, if a cluster is not divided in the case where the lowest similarity in all document or pattern pairs in one cluster is higher than a threshold value, the same number of clusters as the number of topics should, in principle, be obtained.
In this case, it is also difficult to determine the threshold value. Besides, in non-hierarchical clustering, the user is required to input (in advance) the number of clusters into which a given document set is divided. However, it is impossible to accurately input the information about the number of clusters without previous knowledge of the input document or pattern set. As stated above, it is a difficult problem to obtain the correct number of clusters from the input document or pattern set. Although performance has been improved by Liu et al's attempt to correctly infer the number of clusters in non-hierarchical clustering, it is not perfect (X. Liu, Y. Gong, W. Xu and S. Zhu, Document Clustering with Cluster Refinement and Model Selection Capabilities; Proceedings of the 25th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pp. 191-198. Tampere, Finland, August, 2002).
The second problem is clustering accuracy. This is a problem as to whether documents or patterns belonging to the same cluster describe the same topic or object. In clustering processing, in general, a document is expressed by a vector. Each vector component depends on the presence of each term in the document or the term occurrence frequency. The similarity between two clusters is determined on the basis of (1) the cosine similarity between two vectors of documents belonging to different clusters, and (2) the distance between a certain document and a cluster. The distance the document or pattern and cluster is determined by the distance (for example, Euclidean distance) between the vector of the document or pattern and the average vector of documents or patterns in the cluster. In conventional clustering processing, when the cosine similarity or the Euclidean distance is obtained, a vector obtained for each document or pattern is usually used without verifying what term is important for the cluster. Thus, the existence of a term or object feature or a term or object feature pair which is not essential to each cluster can have an influence on the accuracy of the clustering.
The third problem is how to extract the hierarchy of a topic or object. Usually, there is a hierarch in a topic or object. For example, consider the topic “Iraq war”. A subtopic in related news articles may be any one of; “Iraq war”, “Saddam Hussein”, “Inspection of weapons of mass destruction by United Nations”, “Opinion of President Bush” or “Opposition of France, Germany and Russia.” Consider the case of clustering results of such news articles. A user who wants to know about “Iraq war” would want to be shown document groups corresponding to each subtopic obtained by sub-clustering; the user would usually not want to be shown the original clustering results. Since it is difficult to exactly determine clusters corresponding to individual topics and to assign each document to a correct cluster as mentioned above, sub-clustering remains a difficult problem.