1. Field of Invention
The present invention relates to a document delivery system to automatically deliver documents, such as news, in accordance with a user's taste, and specifically, to a document extracting device to exclude documents having similar content from many candidate documents for delivery and extracting only unique documents, a document extracting program, and a document extracting method.
2. Description of Related Art
Generally, in a related art information delivery system capable of being customized for each user, a user sets up a filtering condition, and a computer automatically extracts documents only corresponding to the set-up filtering condition from various information segments (hereinafter, referred to as documents including character information as their major element), such as news delivered in real time, so as to deliver the documents to the user.
In such a related art document delivery system, problems occur wherein the documents to be delivered are too biased depending on the filtering condition, or documents having similar contents are delivered repeatedly. In particular, in the latter problem, since the contents of documents are duplicated, more and more useless information segments are included in the delivered information, or when spaces to carry the documents are limited, other important documents are cut off disadvantageously, thereby seriously damaging convenience or reliability of the document delivery system itself.
For this reason, in order to reduce or prevent such delivery of duplicated documents, a filtering or classification technology to efficiently extract only necessary documents is considered very important. As in the related art, for example, technologies as shown in Japanese Patent No. 3203203 and Japanese Unexamined Patent Application Publication No. 10-275160 described below are suggested.
First, in Japanese Patent No. 3203203, a technology of giving keywords to all documents, making the documents into vectors by using the keywords, introducing a similarity-evaluating criteria to take a maximum value when any document A is included in another document B, and recognizing representative documents, dependent documents and independent documents to collect documents having proper relations together, is disclosed.
On the other hand, in Japanese Unexamined Patent Application Publication No. 10-275160, a technology of computing distinctive quantities of documents to be classified, obtaining the degrees of similarity of the amount of characteristics, and then classifying the documents using a mathematical and statistical cluster analysis, is disclosed.
In the former related art, it is necessary to give characteristic, such as keywords, to all documents, but the task of giving the keywords to all documents is expensive. However, the cluster analysis used in the latter related art is an analysis method suitable for hierarchical classification or grouping. However, the amount of computation increases extremely as the number of documents increases, which creates a problem of a serious decrease in throughput.