1. Field of the Invention
The present invention relates to natural language processing which includes document summarization. More particularly, the present invention makes possible to quantitatively evaluate the commonality of topics among a large number of documents, thereby enhancing the processing performance.
2. Description of the Related Art
When a document set consisting of a plurality of documents is provided, the quantitative evaluation of a topical commonality for the document set necessitates the following techniques:
(A) The degrees to which the topics of the individual documents are common are indicated by numerical values so that whether or not a common topic exists in the document set can be judged.
(B) The individual documents or individual sentences are scored in accordance with the degrees of closeness to a common topic so that the documents or sentences containing topics close to the common topic can be selected from within the document set, to thereby discern the common topics among all the documents.
(C) Even when a topic is not common to all the documents, any group of documents whose topics are common is extracted.
Regarding item (A) of these techniques, in the case of two documents, the score of the commonality of topics can be considered to be the similarity between the two documents, and various measures for the similarity have heretofore been proposed. The most typical measure is the cosine similarity, wherein each document is represented by a vector whose components are the frequencies of individual terms occurring in the document, whereupon the similarity between the two documents is defined by a cosine formed by the vectors of the respective documents.
Items (B) and (C) are techniques which pertain to the extraction of common topics from within a document set. Such processing is important in multi-document summarization, TDT (Topic Detection and Tracking), etc. Heretofore, the extraction of the common topics has been implemented by clustering documents and thereafter selecting sentences or document titles that can typify the respective clusters. Recently, there has also been proposed a method in which common topics are extracted by forming clusters in sentence or passage units and selecting the important passages of the respective clusters. Clustering has heretofore been a technique that is indispensable to the extraction of the common topics. This clustering is broadly classified into a hierarchical technique and a non-hierarchical technique.
The hierarchical technique is subclassified into a bottom-up approach and a top-down approach. In the bottom-up approach, individual documents are set as the seeds of clusters as an initial state, followed by merging the closest clusters, and the process is iterated until the number of clusters becomes equal to 1 (one). Thus, a document set comes to be represented by a tree structure. The top-down approach iterates a process that starts from a state where all documents belong to a single cluster, and in which, when the lowest similarity in all document pairs within the single cluster is less than a threshold, the cluster is divided. In the non-hierarchical technique, a predesignated number of clusters is created so as to satisfy any criterion. A well-known method includes step 1 at which documents in the designated number of clusters are selected at random and are set as the centers of the respective clusters, step 2 at which the degrees of closeness to the respective cluster centers are evaluated for every document, whereupon the respective documents are caused to belong to the closest clusters, step 3 at which the center of each of the resulting clusters is found on the basis of the average of the vectors of the documents belonging to the corresponding cluster, and step 4 at which the processing of the step 2 is executed, and the routine is ended if the clusters to which the respective documents belong have not changed, or the routine is returned to the step 3 if they have changed.
Regarding the technique (A), as stated above, any measure corresponding to the similarity in the case of the two documents has not been known for a case of three or more documents. Therefore, when a group of three documents stating similar topics coexists with a group of four such documents, it has been impossible to answer a question; “Which of the groups have the closest content matches?” The present invention provides a measure that can answer even such a question.
In the extraction of the common topics in the techniques (B) and (C), the bottom-up hierarchical clustering process cannot guarantee that the clusters at each level is meaningful. In aiming at meaningful grouping, only each pair of clusters whose similarities exceed a threshold can be merged, but how to determine the threshold is problematic. Also in the case of the top-down hierarchical clustering process, how to determine the threshold for whether or not the cluster is divided is problematic. Moreover, the problem of a processing complexity cannot be overlooked in the hierarchical technique. In the non-hierarchical technique, it is required to previously know how many clusters a given document set includes. However, the previous knowledge is information that is generally unobtainable, and it is difficult to accurately designate the number of clusters. In this manner, the clustering technique itself has not been completely established. Accordingly, even when the extraction of the common topics has been implemented using the prior-art clustering technique, it is not guaranteed to be optimal. For such reasons, the present invention provides a common-topic extraction method that does not resort to the prior-art clustering technique.