1. Field of the Invention
The present invention relates to natural language processing which includes document summarization. More particularly, the present invention relates to quantitatively evaluating the degree of distinctiveness of a constituent element (such as a sentence, term or phrase) of one of two documents or document sets that have been compared, thereby enhancing the performance of the natural language processing.
2. Description of the Related Art
A process in which two documents or document sets are compared so as to extract the different parts between them is important in multi-document summarization. With regard to the following discussion, the document from which the different parts are extracted shall be called the “target document”, while the other document with which the target document is compared shall be called the “comparison document”. It has heretofore been a common practice to divide both the target document and the comparison document into small elements, and to collate the resulting elements and to identify the elements having no correspondence, as the different parts. The element can be a sentence, a paragraph, and each individual domain in the case where the document has been divided at the change points of topics extracted automatically. In such a case, vector space models are often employed for the collation of the elements. In a case where each element is represented by a vector space model, the components of the vector correspond to individual terms occurring in the document, and the frequency of the corresponding term in the element, or a quantity associated therewith is given as the value of each of the vector components.
The cosine similarity between the vectors can be employed for judging whether the correspondence between the elements is good or bad. The elements are judged to correspond to each other when the cosine similarity is higher than a predetermined threshold. Accordingly, an element of the target document whose similarities to all of the elements of the comparison document are less than the threshold is regarded as the different part. In another known method, after both documents have been represented by graphs, the corresponding relationships of graph elements are found so as to obtain the different parts from the graph elements having no correspondence.
There are two techniques for the extraction of the different parts:                (A) Extracting any part in which expressed information differs.        (B) Extracting any part that reflects a difference of concepts expressed in the documents by both documents.        
Many prior-art methods of multi-document summarization are based on technique (A). The different parts between both the documents are extracted, and the importance of each different part in the target document is not evaluated. Consequently, a part that is not very important as information can be extracted as the different part merely because the part differs from the comparison document. From technique (B), the present invention makes possible the extraction of any different part that satisfies the following conditions:
The different part extracted from the target document is also an important part in the target document. That is, the difference and importance balance. The different part satisfying this condition is more appropriately expressed as a “distinctive part” in the target document, rather than merely the different part. Therefore, a different part satisfying this condition shall be hereinafter called the “distinctive part”.
An evaluation value can be calculated as to the extent of distinctiveness for each sentence of the target document.
An evaluation value can be calculated as to the degrees of distinctiveness of terms or term series for the extracted distinctive part so as to identify what term or term series forms a main factor.