1. Field of the Invention
The present invention relates to a method for retrieving similar documents of a reference document from a plurality of retrieval target documents, and a recommended article notification service system utilizing the similar document retrieval method.
2. Description of the Background Art
The well known retrieval models for retrieving similar documents include a vector space model such as tf.multidot.idf and a probabilistic model in which a similarity with respect to a retrieval requested document is expressed by a ratio of a relevant document probability and an non-relevant document probability with respect to a retrieval request. An example of the probabilistic model is disclosed in Iwayama et al.: "Hierarchical Bayesian Clustering for Automatic Text Classification", Proceedings of IJCAI-95, pp. 1322-1327, 1995, for example. When the vector space model and the probabilistic model are compared, the probabilistic model has a clearer meaning with respect to a value of the similarity (distance), and the probabilistic model is expected to have a superior precision at a time of clustering as shown by Iwayama et al. mentioned above, so that the probabilistic model is considered superior.
FIG. 1 is a graph showing a distribution of similar documents and dissimilar documents in the case where the similarities of a plurality of target documents are calculated in order to retrieve similar documents with respect to a given reference document, using the probabilistic model of Iwayama et al. In FIG. 1, the horizontal axis represents the similarity while the vertical axis represents a relative frequency, and black rectangle marks indicate a plot of the similarities of similar documents while white rectangle marks indicate a plot of the similarities of dissimilar documents. Note that this distribution was calculated using 10,000 target documents extracted from the Published Japanese Patent Applications between 1993 and 1999, with respect to 21 retrieval requests. Also, the comprehensive similarity Judgment for these 10,000 Published Japanese Patent Applications with respect to each retrieval request was made by experts.
As shown in FIG. 1, in the high similarity region such as a region with the similarity not greater than -1.0, there are hardly any dissimilar document so that the similar documents and the dissimilar documents can be separated almost completely. It can be seen that a distribution of the similar documents is flatter and more widespread compared with a distribution of the dissimilar documents. For this reason, there are many portions where the separation from the dissimilar documents is not realized very well because of the low similarities of some similar documents.
In the similar document retrieval using the probabilistic model of Iwayama et al. that is considered as a superior probabilistic model, the result of retrieval experiments using Iwayama et al's similarity measure can be analyzed in detail to reveal that, as can be seen in a graph of FIG. 5, the similar documents with relatively high similarities can be appropriately separated from the dissimilar documents, but there are many dissimilar documents at somewhat lower similarities so that the similar documents and the dissimilar documents coexist there, and they lowers the overall retrieval precision so that it is difficult to obtain the sufficient retrieval precision.