1. Field of the Invention
The present invention relates to a document detection system for detecting desired documents from a large number of documents stored in a document database. It is to be noted that the term "retrieval" is often used in the literature of the field instead of the term "detection" used in the following description. The present specification adheres to the use of the term "detection" throughout.
2. Description of the Background Art
In recent years, due to the significant progress and spread of computers, the electronic manipulations of documents are becoming increasingly popular as in the electronic news and electronic mail systems and the CDROM publications of data sources such as dictionaries and encyclopedia that had only been available on papers, and it is expected that this trend of the electronic manipulations of documents will continue at an increasing pace in future.
In conjunction with such electronic manipulations of documents, much attentions have been attracted to a document detection system for detecting desired documents from a large number of documents efficiently, so as to enable the effective utilization of the documents stored in a database system in advance.
As a conventionally available document detection system, there has been a system which uses keywords in combination with logic operators such as AND, OR, NOT or proximity operators for specifying numbers of characters, sentences, and paragraphs that can exist between keywords, and detects a document by using a specified combination of keywords and operators as a detection key.
However, in such a conventional document detection system, it is not necessarily possible to detect the document that is truly desired by a user. Namely, in a case of employing the detection key using the logic operators, when the specified detection key is "computer AND designing", a document having a content of "designing by a computer" as well as a document having a content of "designing of a computer" can be detected, so that at least one of them will have a content irrelevant to a desired content. On the other hand, when the detection key using the proximity operators is employed, the detection is based solely on a physical distance between keywords, so that there is no guarantee that the detected document has a desired content.
Thus, in such a conventional document detection system, the detection result could contain many documents with contents actually irrelevant to the desired content, so that it has been necessary to use the detection key formed by as many keywords that are expected to be related to the desired content as possible. However, in practice, the detection result obtained by using such a detection key formed by a large number of disjunctive keywords would end up containing a considerable number of detection noises and Junks.
For this reason, the conventional document detection system requires an enormous amount of time for an user to single out the desired document by checking each of the detected documents one by one. On the other hand, if the detection key is formed by narrow keywords in order to reduce the number of the detection noises, the probability for the detection error would be increased.
As a result, in the conventional document detection system, it has been difficult to reduce the number of the detection noises without causing the detection error unless the user has a detailed knowledge concerning what kinds of keywords are contained in what kinds of documents, and consequently it has been formidably difficult for an ordinary user without such a detailed knowledge to handle the conventional document detection system effectively.
In addition, in the conventional document detection system, the detection result has been informed by displaying either a number of detected documents or titles of the detected documents alone, so that in order for the user to check each of the detected documents to see if it is the desired document or not, it has been necessary for the user to read the entire content of each of the detected documents one by one, and this operation has been enormously time consuming.
Moreover, in the conventional document detection system, in displaying the titles of the detected documents, the titles are simply arranged in a prescribed order according to the user's query such as an order of descending similarities to the keywords used in the detection key. For this reason, it has been impossible for the user to comprehend the relative relationships among the detected documents and the level of similarity with respect to the detection command for each of the detected documents from the displayed detection result, and consequently it has been difficult for the user to have an immediate impression for the appropriateness of the displayed detection result.
Furthermore, in the conventional document detection system, the detection scheme is limited to that in which each document as a whole is treated as a single entity, so that the document containing the desired content in the background section and the document containing the desired content in the conclusion section will be detected together in mixture. In other words, the detection result contains a variety of documents mixed regardless of viewpoints in which the desired contents appear in the documents. For example, if there is no interest in what had been done in the past, the detected document which matches with the given keywords in the background section will be of no use. Yet, in the conventional document detection system, the documents having different perspectives such as the document containing the desired content in the background section and the document containing the desired content in the conclusion section will not be distinguished, and the mixed presence of these documents in different perspectives makes it extremely difficult for the user to judge the appropriateness of the detection result.
In view of these problems, there has been a proposition for a scheme to reduce the burden on the user to read the entire content of each detected document by displaying only a portion of each detected document. However, in such a scheme, it is often impossible to make a proper judgement as to whether it is the desired document or not unless the relationship of the displayed portion and the remaining portion becomes apparent. For example, when the background section containing the desired content is displayed for one document while the conclusion section containing the desired content for the other document, as these documents cannot be comprehended in a unified viewpoint, it is difficult for the user to make a proper judgement as to which one of these document is the necessary one. As a result, in order to fully comprehend the perspectives of the displayed portions in these documents, the user would be forced to read the entire contents of these documents after all, so that it cannot contribute to the reduction of the burden on the user at all.
Also, there has been a proposition for a scheme to reduce the burden on the user to read the entire content of each detected document by providing a man-made document summary for each stored document in advance in correspondence to each stored document itself and displaying the document summary at a time of displaying the detection result. However, in such a scheme, an enormous amount of human effort is required for preparing the document summary for each document at a time of producing the database itself, which is not practically justifiable unless the database system has a remarkably high utilization rate. Moreover, there are many already existing database systems in which the document summary for each document is not provided, and an enormous amount of human efforts is similarly required for preparing the document summary for each document in such an already existing database system. In addition, the man-made document summary is produced in the very general viewpoint alone, so that there is no guarantee that each document is summarized from a viewpoint suitable for the required detection. As a result, the document summary displayed as the detection result can be quite out of point from the viewpoint of the user with the specific document detection objective, and in such a case, it is possible for the user to overlook the actually necessary document at a time of judging whether each detected document is the desired document or not.