The present invention relates to a document retrieving and delivering technique in which an electronic document is retrieved according to a retrieval condition registered by a user in advance and documents satisfying the condition are delivered to the user.
Recently, a large amount of electronic documents (to be referred to as texts herebelow) have been delivered at every moment to users through an electronic mail or e-mail, electronic news, and the like. Information sources which transmit information through the World Wide Web (WWW) are rapidly increasing and hence an immense amount of texts have been collected from such information sources using an information collecting robot or the like. There consequently arises a need for a document retrieving and delivering system in which texts containing information requested by a user are retrieved therefrom and are delivered to the user.
JP-A-10-27182 (to be referred to as prior art 1) describes such a document or text retrieving and delivering system. In this system, retrieval condition expressions of a plurality of users are combined with each other to process condition expressions of a plurality of users through one text scanning operation.
However, in prior art 1, the user is required to generate retrieval condition expressions, which leads to two problems as follows.
First, when a rarely used word is specified in a retrieval condition or when generally used words are complicatedly combined with each other in a retrieval condition specified, there appears texts which cannot be retrieved (retrieval leakage).
Second, in contrast with the first problem, when a simple retrieval condition expression containing only generally used words is specified, there are possibly retrieved many documents or texts (to be referred to as retrieval noise) not suitable for an object of the retrieval. This leads to a problem that documents desired by the user cannot be easily attained.
In short, to obtain retrieval results in which texts not retrieved as above are minimized and in which the noise is reduced, it is difficult for the user to appropriately generate a retrieval condition expression.
Japanese Patent Application Serial No. 10-148721 (to be referred to as prior art 2) describes a technique to improve two problems above in a document retrieval system in which documents containing information desired are retrieved from documents (to be referred to as registered documents herebelow) registered to a text database.
In this technique, a keyword (called “feature character string” in prior art 2) is extracted from a text (to be referred to as a seed text) exemplified as a retrieval condition to calculate similarity of the seed document with respect to registered documents.
In prior art 2, the user needs only to exemplify a seed document containing information desired. Namely, the user is relieved from the troublesome job to select appropriate retrieval terms for a retrieval condition expression. The user then instructs execution of retrieval to view retrieval results sorted according to the similarity. Therefore, even when the retrieval results include some retrieval noise, the user can easily attain necessary information.
Next, description will be given of an outline and problems of the prior arts above.
Referring to FIG. 2, an outline of prior art 1 will be described.
In this example, three users, i.e., users 1 to 3 have registered retrieval condition expressions to a document retrieving and delivering system, i.e., document containing “new” and “car”, document containing USA, and document containing used and car, respectively. Under this condition, a scanning operation is conducted using a text collected “price of this new car is . . . ” to determine whether or not the three conditions are satisfied.
The retrieval condition expressions registered by the users are analyzed to extract retrieval terms “new”, “car”, “USA”, and “used”.
The number of retrieval terms extracted is stored for each user in a retrieval term count table. For example, from retrieval condition expression of user 1, i.e., document containing “new” and “car” registered by user 1, two retrieval terms “new” and “car” are extracted and hence “2” is stored in an associated field of the table. In a similar fashion, “1” and “2” are stored in associated fields of the table for users 2 and 3, respectively.
Next, the system creates a finite automaton to collate all retrieval terms extracted.
In the finite automaton in FIG. 2, a circle indicates a state of the automaton and an arrow denotes a state transition. A character next to the arrow represents input characters which cause the transition of the arrow. A numeral in the circle designates a state number of the automaton state. This example does not include an arrow to an initial state to be used when a character not indicated in the automaton is inputted (to be called a failure herebelow).
The system then forms a user list including elements each including a user identifier of a user having specified a retrieval term. The list is linked with retrieval term detection states of the automaton respectively associated with. In this example, when “car” is collated, the system refers to an associated user list item according to the last state “3”. This indicates that users 1 and 3 have specified “car”.
Description will next be given of the scanning of a text “price of this new car is” in the automaton shown in FIG. 2. In this example, it is detected that the text includes partial character strings in which “car” or “new” appears. In this automaton, a retrieval term having a small circle at an end thereof means that a partial character string matching the term exists in the text. Since partial character strings matching with “car” or “new” appear in the text in FIG. 2, end states 3 and 6 are assigned with a small circle.
In the texts, the number of retrieval terms matching partial character strings in the text are counted for each user and is stored in a retrieval term appearance count table. For example, since the matching state is detected for “new” and “car” or user 1, “2” is set to the count value. Only car is matching for user 3, “1” is counted. For user 2, the matching state does not occur for any partial character strings, and hence the counting is not achieved and “0” is kept unchanged for the count value.
The retrieval term count table in which the retrieval term counts extracted from the retrieval condition expressions are stored is compared with the retrieval term appearance count table in which the numbers of retrieval terms appearing in partial character strings in the text are stored. When these tables match each other, it is assumed that the retrieval condition expressions of the user are satisfied and hence the text is delivered to the user. In FIG. 2, the retrieval term count is “2” for user 1 in both tables and hence the text is delivered to user 1. The retrieval term counts are respectively different from each other for users 2 and 3 and hence the text is not delivered to users 2 and 3.
Prior art 1 has been briefly described.
In accordance with prior art 1, it is possible to implement a document retrieving and delivering system in which a text matching retrieval condition expressions given can be delivered to the user through one scanning operation.
However, the user must generate retrieval condition expressions in prior art 1. There consequently arises a problem, namely, it is not easy for the user to appropriately generate retrieval condition expressions.
Prior art 2 has been proposed to improve the problem above in a document retrieval system.
Referring now to FIG. 20, an outline of prior art 2 will be described.
Prior art 2 is a technique to extract keywords from a sentence of a language, e.g., Japanese not using a separation code between words.
FIG. 20 shows an example to extract keywords (to be described in accordance with a name “tokuchomojiretsu (feature character string)” in prior art 2 herebelow) from a seed document “ . . . . Keitaidenwa no shiyohji no mana ga mondai ni naru (manners of use of a cellular phone causes a problem). . . . ”.
In step 1910, a single character type seed character string extraction program is started to subdivide a seed document 1920 at boundaries of character types such as kanji (Chinese characters) and katakana (angular Japanese phonetic letters) to extract character strings (to be called single character type character strings herebelow) 1921 each including characters of one character type.
In step 1911, a check is made to determine a character type for each of the extracted strings. For a character string of Chinese or angular Japanese characters which possibly configure a complex word, a division probability comparison feature character string extraction program is executed to subdivide any complex word to extract feature character strings. For character strings of the other character types having a low probability of configuration of a complex word, the character strings of a single character type are directly extracted as feature character strings (step 1912).
In step 1913, the feature character strings thus extracted are stored in a work area.
Description has been given of an example to extract feature character strings from a seed document in prior art 2.
In prior art 2, according to the number of appearances of feature character strings in a seed document as extracted above and the number of appearances thereof in each document in a text database, the similarity is calculated for each document to display the documents in the descending order of similarity. A method of calculating similarity has been described in prior art 2 and hence description thereof will be here avoided.
An outline of prior art 2 has been described.
In accordance with prior art 2, feature character strings (keywords) are extracted from a seed document and then similarity of each registered document to the seed document is calculated using the keywords. The user specifies a document containing information desired by the user such that the user then refers to results of retrieval in the descending order of similarity to obtain texts containing necessary information from the text database.
Consequently, when prior art 2 is applied to prior art 1, it is possible to assign, to a registered document including at least one of the keywords extracted from the seed document, similarity with respect to the seed document. This improves the first problem of prior art 1.
However, for each text delivered at every moment, prior art 1 makes a check to determine the matching of the retrieval condition to instantaneously deliver any text matching the condition to the user. Therefore, although similarity can be calculated for each text with respect to the retrieval condition in prior art 2, the similarity cannot be compared with similarity thereof with respect to another text.
Namely, even when prior art 2 is simply applied to a document retrieving and delivering system of prior art 1, the results of retrieval cannot be sorted in the descending order of similarity to be delivered to the user. The second problem of prior art 1 cannot be improved.
Additionally, although statistic information in the text database can be used to extract keywords and/or to calculate similarity, texts delivered is not saved in the document retrieving and delivering system of prior art 1. This leads to a problem that the statistic information cannot be easily obtained.