The present invention relates to a document retrieving and delivering technique in which an electronic document is retrieved according to a retrieval condition registered by a user in advance and documents satisfying the condition are delivered to the user.
Recently, a large amount of electronic documents (to be referred to as texts herebelow) have been delivered at every moment to users through an electronic mail or e-mail, electronic news, and the like. Information sources which transmit information through the World Wide Web (WWW) are rapidly increasing and hence an immense amount of texts have been collected from such information sources using an information collecting robot or the like. There consequently arises a need for a document retrieving and delivering system in which texts containing information requested by a user are retrieved therefrom and are delivered to the user.
JP-A-10-27182 (to be referred to as prior art 1) describes such a document or text retrieving and delivering system. In this system, retrieval condition expressions of a plurality of users are combined with each other to process condition expressions of a plurality of users through one text scanning operation.
However, in prior art 1, the user is required to generate retrieval condition expressions, which leads to two problems as follows.
First, when a rarely used word is specified in a retrieval condition or when generally used words are complicatedly combined with each other in a retrieval condition specified, there appears texts which cannot be retrieved (retrieval leakage).
Second, in contrast with the first problem, when a simple retrieval condition expression containing only generally used words is specified, there are possibly retrieved many documents or texts (to be referred to as retrieval noise) not suitable for an object of the retrieval. This leads to a problem that documents desired by the user cannot be easily attained.
In short, to obtain retrieval results in which texts not retrieved as above are minimized and in which the noise is reduced, it is difficult for the user to appropriately generate a retrieval condition expression.
Japanese Patent Application Serial No. 10-148721 (to be referred to as prior art 2) describes a technique to improve two problems above in a document retrieval system in which documents containing information desired are retrieved from documents (to be referred to as registered documents herebelow) registered to a text database.
In this technique, a keyword (called xe2x80x9cfeature character stringxe2x80x9d in prior art 2) is extracted from a text (to be referred to as a seed text) exemplified as a retrieval condition to calculate similarity of the seed document with respect to registered documents.
In prior art 2, the user needs only to exemplify a seed document containing information desired. Namely, the user is relieved from the troublesome job to select appropriate retrieval terms for a retrieval condition expression. The user then instructs execution of retrieval to view retrieval results sorted according to the similarity. Therefore, even when the retrieval results include some retrieval noise, the user can easily attain necessary information.
Next, description will be given of an outline and problems of the prior arts above.
Referring to FIG. 2, an outline of prior art 1 will be described.
In this example, three users, i.e., users 1 to 3 have registered retrieval condition expressions to a document retrieving and delivering system, i.e., document containing xe2x80x9cnewxe2x80x9d and xe2x80x9ccarxe2x80x9d, document containing USA, and document containing used and car, respectively. Under this condition, a scanning operation is conducted using a text collected xe2x80x9cprice of this new car is . . . xe2x80x9d to determine whether or not the three conditions are satisfied.
The retrieval condition expressions registered by the users are analyzed to extract retrieval terms xe2x80x9cnewxe2x80x9d, xe2x80x9ccarxe2x80x9d, xe2x80x9cUSAxe2x80x9d, and xe2x80x9cusedxe2x80x9d.
The number of retrieval terms extracted is stored for each user in a retrieval term count table. For example, from retrieval condition expression of user 1, i.e., document containing xe2x80x9cnewxe2x80x9d and xe2x80x9ccarxe2x80x9d registered by user 1, two retrieval terms xe2x80x9cnewxe2x80x9d and xe2x80x9ccarxe2x80x9d are extracted and hence xe2x80x9c2xe2x80x9d is stored in an associated field of the table. In a similar fashion, xe2x80x9c1xe2x80x9d and xe2x80x9c2xe2x80x9d are stored in associated fields of the table for users 2 and 3, respectively.
Next, the system creates a finite automaton to collate all retrieval terms extracted.
In the finite automaton in FIG. 2, a circle indicates a state of the automaton and an arrow denotes a state transition. A character next to the arrow represents input characters which cause the transition of the arrow. A numeral in the circle designates a state number of the automaton state. This example does not include an arrow to an initial state to be used when a character not indicated in the automaton is inputted (to be called a failure herebelow).
The system then forms a user list including elements each including a user identifier of a user having specified a retrieval term. The list is linked with retrieval term detection states of the automaton respectively associated with. In this example, when xe2x80x9ccarxe2x80x9d is collated, the system refers to an associated user list item according to the last state xe2x80x9c3xe2x80x9d. This indicates that users 1 and 3 have specified xe2x80x9ccarxe2x80x9d.
Description will next be given of the scanning of a text xe2x80x9cprice of this new car isxe2x80x9d in the automaton shown in FIG. 2. In this example, it is detected that the text includes partial character strings in which xe2x80x9ccarxe2x80x9d or xe2x80x9cnewxe2x80x9d appears. In this automaton, a retrieval term having a small circle at an end thereof means that a partial character string matching the term exists in the text. Since partial character strings matching with xe2x80x9ccarxe2x80x9d or xe2x80x9cnewxe2x80x9d appear in the text in FIG. 2, end states 3 and 6 are assigned with a small circle.
In the texts, the number of retrieval terms matching partial character strings in the text are counted for each user and is stored in a retrieval term appearance count table. For example, since the matching state is detected for xe2x80x9cnewxe2x80x9d and xe2x80x9ccarxe2x80x9d or user 1, xe2x80x9c2xe2x80x9d is set to the count value. Only car is matching for user 3, xe2x80x9c1xe2x80x9d is counted. For user 2, the matching state does not occur for any partial character strings, and hence the counting is not achieved and xe2x80x9c0xe2x80x9d is kept unchanged for the count value.
The retrieval term count table in which the retrieval term counts extracted from the retrieval condition expressions are stored is compared with the retrieval term appearance count table in which the numbers of retrieval terms appearing in partial character strings in the text are stored. When these tables match each other, it is assumed that the retrieval condition expressions of the user are satisfied and hence the text is delivered to the user. In FIG. 2, the retrieval term count is xe2x80x9c2xe2x80x9d for user 1 in both tables and hence the text is delivered to user 1. The retrieval term counts are respectively different from each other for users 2 and 3 and hence the text is not delivered to users 2 and 3.
Prior art 1 has been briefly described.
In accordance with prior art 1, it is possible to implement a document retrieving and delivering system in which a text matching retrieval condition expressions given can be delivered to the user through one scanning operation.
However, the user must generate retrieval condition expressions in prior art 1. There consequently arises a problem, namely, it is not easy for the user to appropriately generate retrieval condition expressions.
Prior art 2 has been proposed to improve the problem above in a document retrieval system.
Referring now to FIG. 20, an outline of prior art 2 will be described.
Prior art 2 is a technique to extract keywords from a sentence of a language, e.g., Japanese not using a separation code between words.
FIG. 20 shows an example to extract keywords (to be described in accordance with a name xe2x80x9ctokuchomojiretsu (feature character string)xe2x80x9d in prior art 2 herebelow) from a seed document xe2x80x9c. . . . Keitaidenwa no shiyohji no mana ga mondai ni naru (manners of use of a cellular phone causes a problem). . . . xe2x80x9d.
In step 1910, a single character type seed character string extraction program is started to subdivide a seed document 1920 at boundaries of character types such as kanji (Chinese characters) and katakana (angular Japanese phonetic letters) to extract character strings (to be called single character type character strings herebelow) 1921 each including characters of one character type.
In step 1911, a check is made to determine a character type for each of the extracted strings. For a character string of Chinese or angular Japanese characters which possibly configure a complex word, a division probability comparison feature character string extraction program is executed to subdivide any complex word to extract feature character strings. For character strings of the other character types having a low probability of configuration of a complex word, the character strings of a single character type are directly extracted as feature character strings (step 1912).
In step 1913, the feature character strings thus extracted are stored in a work area.
Description has been given of an example to extract feature character strings from a seed document in prior art 2.
In prior art 2, according to the number of appearances of feature character strings in a seed document as extracted above and the number of appearances thereof in each document in a text database, the similarity is calculated for each document to display the documents in the descending order of similarity. A method of calculating similarity has been described in prior art 2 and hence description thereof will be here avoided.
An outline of prior art 2 has been described.
In accordance with prior art 2, feature character strings (keywords) are extracted from a seed document and then similarity of each registered document to the seed document is calculated using the keywords. The user specifies a document containing information desired by the user such that the user then refers to results of retrieval in the descending order of similarity to obtain texts containing necessary information from the text database.
Consequently, when prior art 2 is applied to prior art 1, it is possible to assign, to a registered document including at least one of the keywords extracted from the seed document, similarity with respect to the seed document. This improves the first problem of prior art 1.
However, for each text delivered at every moment, prior art 1 makes a check to determine the matching of the retrieval condition to instantaneously deliver any text matching the condition to the user. Therefore, although similarity can be calculated for each text with respect to the retrieval condition in prior art 2, the similarity cannot be compared with similarity thereof with respect to another text.
Namely, even when prior art 2 is simply applied to a document retrieving and delivering system of prior art 1, the results of retrieval cannot be sorted in the descending order of similarity to be delivered to the user. The second problem of prior art 1 cannot be improved.
Additionally, although statistic information in the text database can be used to extract keywords and/or to calculate similarity, texts delivered is not saved in the document retrieving and delivering system of prior art 1. This leads to a problem that the statistic information cannot be easily obtained.
It is therefore an object of the present invention to provide a system in which according to retrieval conditions inputted by a plurality of users, a check is made to determine whether or not the conditions are satisfied with respect to a text inputted such that the text inputted is delivered to users corresponding to the retrieval conditions satisfied to thereby deliver texts desired by the users.
To improve the problems above in accordance with the document retrieving and delivering method of the present invention, through the steps described below, similarity of each text acquired is calculated for a document (to be referred to as a seed document herebelow) containing information requested by the user to thereby deliver an appropriate text to the user.
Namely, the document retrieving and delivering method of the present invention includes a retrieval condition registering steps of registering retrieval conditions inputted from a plurality of users and a retrieval and delivery step for retrieving texts satisfying the retrieval condition from text data of document information and for delivering the texts retrieved to associated users.
The retrieval and delivery step includes calculating, from the texts, a ratio the text which matches the retrieval conditions (to be called similarity herebelow) and whether or not the retrieval conditions are satisfied is determined according to the similarity to deliver the text to the users corresponding to the retrieval conditions satisfied.
By the operation, similarity of the text acquired can be determined with respect to a document including information desired by the user to thereby appropriately deliver the text to the user.
The retrieval condition inputted from the user may be provided in the form of a document desired by the user. In such a situation, retrieval conditions are generated according to the document in the present invention.