1. Field of the Invention
The present invention relates to a document retrieving apparatus and a method thereof for inputting a specified character string that becomes a retrieval key and displaying a retrieval result, in an information processing system for retrieving a document including a specified character string from a database.
2. Description of the Related Art
In the present age in which computer-readable documents are increasing, a system is required which can easily retrieve documents having a high relevance level to a given key word through a communication network and the like.
In a conventional document retrieving apparatus, the following systems (1), (2), (3) and (4) are designed as mechanisms for retrieving a document having a high relevance level to given keywords.
(1) Retrieving System for Inputting Any Number of Free Terms (Character Strings) or Logical Expressions.
A user specifies an arbitrary number of retrieval character strings or a logical expression therebetween. The system retrieves a document in which a given character string is included or a document with which a given logical expression is satisfied. The relevance levels of obtained documents are calculated by using statistical information in a database, and the documents are displayed in order of their respective relevance levels. There are systems, such as WAIS (a retrieving system designed by the Wais company in the U.S.A.), TARGET (a retrieving system designed by the Dialog company in the U.S.A.) and the like.
For example, in WAIS, the relevance level of the document is calculated from the following four weight elements. However, the actual formula is not published.
(a) Word Weight: This is weight! added in terms of the occurrence of a retrieval word in a document. The highest word weight is given to a word occurring in a headline of the document. The second highest word weight is given to the word where all characters or the first character are capital letters. The least word weight is given in a case of a word occurring only in text, except for a title and the like. PA1 (b) Term Weight: A word occurring frequently in a document has lower term weight than a word occurring rarely. PA1 (c) Proximity Relationship: Higher weight is given to different retrieval words that occur adjacent to each other within a document. PA1 (d) Word Density: This is a ratio of an occurrence number of a retrieval word to a document size. PA1 (e) Number of Respective Retrieval Phrases Included in Record PA1 (f) Retrieval Phrase Occurring In Record PA1 (g) Distance between Different Retrieval Phrases PA1 (h) Number of Retrieval Phrases in Database
The following elements are used in a case of a relevance level judging algorithm used in TARGET, and 50 titles in order of relevance level are displayed as a retrieval result. However, the actual equation of the relevance level is not published.
(2) Retrieving System in Which the User Specifies a Priority Level of Retrieval Phrase, in Addition to Above Mentioned System (1).
A user allocates a priority level, being a real number that mainly belongs to a period 0,1!, to arbitrary retrieval character strings, or specifies a logical expression to determine their priority levels. This system is characterized in that the relevance level is calculated by using information specified by the user in addition to the above mentioned system (1). This has been studied, as a so-called weighted Boolean or extended Boolean system.
As a weighted Boolean system, the MMM (Mixed Min and Max) model was proposed by Fox and Sharat (Fox, E. A., and S. Sharat. 1986. "A Comparison of Two Methods for Soft Boolean Interpretation in Information Retrieval." Technical Report TR-86-1, Virginia Tech, Department of Computer Science). In the MMM model, a priority level of a character string wi (i=1, 2, . . . ) is di, a priority level of (w1 AND w2) is min (d1, d2), and a priority level of (w1 OR w2) is max (d1, d2). For the document including the character strings w1 and w2, the priority level is calculated by using the specified expression. Then, it is examined to what degree the document satisfies the logical expression. It is considered that the larger the calculated value, the higher the relevance level.
(3) Retrieving System for Inputting a Sentence and Selecting a Character String Similar Thereto.
A text, which contains sentences, not merely a character string, is assumed to be an input, and a document in relation thereto is retrieved from a database. A keyword is suitably extracted from the inputted text. Then, a document having a characteristic of a similar keyword is retrieved. The basic operation is the same as the above mentioned systems (1) and (2).
(4) System for Highlighting and Outputting a Retrieval Character String.
A retrieval character string is distinguished from a normal character (basic sentence) by highlighting the retrieval character string by means of a bold character, an underline, or putting the retrieval character string in parentheses. Since it is enough that the retrieval character string can be distinguished from the normal characters, the highlight level of the retrieval character string is set at only one level. Examples of patent applications with respect to these outputting processes, are Character String Highlight Outputting Method! (Japanese Patent Application No.61-080035 and Japanese Patent Application Laying Open (KOKAI) No.62-236746), Document Processing Apparatus! (Japanese Patent Application No.63-127103 and Japanese Patent Laying Open (KOKAI) No.01-297768), Data Outputting Apparatus (Japanese Patent Application No.02-328160 and Japanese Patent Laying Open (KOKAI) No.04-195518) and the like.
However, there are the following problems in the conventional retrieval systems as mentioned above.
In the above mentioned system (1), the priority level is not added to the retrieval word inputted by a user. Thus, in a case of inputting more than one retrieval word, they are all processed at the same level. For this reason, there is little information available for use at the time of calculating the relevance level of a document of a retrieved candidate, and thereby noise is likely to be induced into the retrieval result (the precision is lowered). Here, the noise induced in the retrieval result means unnecessary documents which the user did not intend to retrieve. For example, a document, in which unimportant words among the inputted retrieval words are included, may be noise in the retrieval results.
In a case of the above mentioned system (2), the fact that the user specifies the importance level by using a real number, such as 0.2 and the like, brings about problems that it takes effort to input and that it is difficult to understand intuitively.
In a case of the above mentioned system (3), at the time of extracting a word from a Japanese input text, it is necessary to carry out an analyzing process for determining individual words, which is different from English. This is because, in a Japanese text, a space is not inserted between words, which is different from an English sentence, and each word's boundary is not evident. If a position of an individual word is mistaken, a document which is not in relation to the input text may be retrieved, resulting in the possibility that noise is induced in the retrieval result.
In a case of inputting by adding the priority level to a retrieval word, it is desired to display also the retrieval word included in the retrieval result corresponding to the priority level thereof. However, in the highlight level having just one level in the character string, such as the above mentioned system (4), it is impossible to distinguish between a plurality of character strings in which the priority levels are different, and thereby the number of levels implemented by the system (4) is insufficient.