The present invention relates to a text comparison apparatus for calculating the similarity and the discrepancy between a plurality of texts, such as patent documents for example.
Conventionally, the similarity between two documents is calculated by using keywords entered by a user. On the other hand, JP H11-73422A (title: “Similar document retrieval system and storage medium used for same”) is an example of a system calculating the similarity of two texts without keywords given by a user. This system has an internal index, and when a text is entered, words are extracted from the entered text to update the index. Information about the frequency of the words registered in the index is held in the index, and setting the significance of the word with the highest frequency to “1,” the significance of general words is defined by the proportion to words with a large frequency. The similarity of two texts is calculated using the significance of the n words with the highest significance from the text serving as the reference in the comparison, wherein n is an integer that can be specified by the system parameters when calculating similarity. In the similarity calculation, the significances of the n words in the text serving as the reference in the comparison are summed up in the denominator, and the smaller significance value of each of the n words for the two texts is selected and the total thereof taken as the numerator.
In accordance with this conventional system, if the word with the highest significance is a word that does not convey any characteristics to the text comparison and that is used extremely often, then the significance value of the other n−1 words decreases, and the similarity may be judged to be low. Furthermore, this conventional system is not adapted to multiple languages.