A document clustering system is a system that takes inputted document sets and sorts similar documents into the same group.
Non-patent literature 1 discloses an example of a document clustering system. The document clustering system that is disclosed in non-patent document 1 places one document as a collection of words, and expresses that document as a vector with each word as an element. When finding similarity, such as cosine similarity, between two documents, the document clustering system disclosed in non-patent literature 1 finds the similarity between the two documents based on distance that is expressed in vector space.
Here, the value of a vector element is a statistical amount such as the frequency that a word appears in each document, or the TFIDF (Term Frequency Inverse Document Frequency) that is based on the frequency of appearance. In the document clustering system disclosed in non-patent literature 1, after the similarity is found, the documents are compiled by a method such as K-MEANS or hierarchical clustering. As a result, documents, in which similar words appear, form one group. Hereafter, a group of similar documents will be called a cluster.
On the other hand, there is an ontology mapping system that takes the input of two concept tree structures that indicated the hierarchical relationship between a plurality of two words, and finds the correspondence of each. Non-patent literature 2 discloses an example of an ontology mapping system. An ontology mapping system is a system that finds what kind of correspondence there is between two different concept tree structures. The similarity between concept tree structures is an index of the similarity of character strings, or an index that is based on knowledge sources that use a concept tree structure graph. Non-patent literature 2 discloses four methods that are based on knowledge sources that use concept tree structure graphs. The methods disclosed in non-patent literature 2 are: (1) a method that uses synonyms; (2) the method of Wu & Palmer et al.; (3) a method that uses explanations; and (4) the method of Lin et al. The method (1) that uses synonyms is a method of finding the similarity by using the lengths of two concept paths of a concept tree structure. The method (2) of Wu & Palmer et al. is a method of finding the similarity according to the equation below based on the depth and least common superconcept (LCS).Similarity (W1,W2)=2×depth(LCS)/(depth(W1)+depth(W2))
Here, W1 and W2 represent words, the depth is the depth of the word in the concept tree structure, and LCS is the LCS (least common superconcept) of W1 and W2.
The method (3) that uses explanations is a method of finding the similarity by presuming that an written explanation is assigned to each word in the concept tree structure, and using that written explanation. The similarity is found based on the square of the length of words that are common in the written explanations of each word. The method (4) of Lin et al. is a method that is similar to the equation in the method of Wu & Palmer et al., however is a method that uses the amount of information instead of the depth of the words in the concept tree structure.
Furthermore, non-patent literature 3 discloses a technique of performing clustering by assigning constraints to document pairs that are targets of clustering. In the clustering with constraints disclosed in non-patent literature 3, clustering is performed so that in addition to objective functions corresponding to similarity, assigned constraints are met.
In addition, patent literature 1 discloses a multi-dimensional space modeling apparatus that sorts documents that are searched. The multi-dimensional space modeling apparatus in patent literature 1 sorts a large quantity of technical documents into several clusters in multi-dimensional space, and creates a cluster map by placing these clusters on a two-dimensional plane. With the multi-dimensional space modeling apparatus disclosed in patent literature 1, the closer the distance is between clusters, the higher the error precision of the obtained cluster map is, and it is possible to know visually the relationship between similar clusters.