The present invention relates to a method and system for extracting a character string indicative of a feature of contents described in a document, a method and system for searching a document database for a document or documents having contents similar to those described in a document specified by a user with use of the first-mentioned method and system, and a storage medium for storing a searching program therein.
As use of personal computers and Internet spreads, electronic documents have been explosively increased in these years. And its acceleratingly increasing spread is estimated in future. In such circumstances, such a strong demand has been enhanced that a user wants to search quickly and efficiently for a document or documents containing information desired by the user.
One of techniques for satisfying such a demand is a full-text search. In the full-text search, documents to be searched are registered as a text in a computer system for creation of a database, and the system searches the database for a document or documents containing a search character string (which will be referred as a query term, hereinafter) specified by a user. In this way, the full-text search is featured in that, since the searching is carried out for the character string itself in the documents, any word can be searched unlike a prior art keyword searching system based on a previously-set keyword.
However, in order to reliably search for a document or documents containing information desired by the user, it is necessary for the user to make a complex search conditional expression accurately indicative of user""s search intention and to enter it into the system. This is a hard business for ordinary users who are not experts on information search.
For the purpose of eliminating such troublesomeness, much attention is now focused on a relevant document searching technique for showing as an example a document (which will be referred to as a xe2x80x98seedxe2x80x99 document, hereinafter) containing contents desired by a user per se to search for a document or documents similar to the seed document.
Disclosed as one of the relevant document searching methods is, for example, a technique (which will be referred to as the prior art 1, hereinafter) for extracting words contained in a seed document through morphological analysis to search for a relevant document or documents based on the extracted words, as in JP-A-8-335222.
In the prior art 1, words contained in a seed document are extracted through morphological analysis to search for a relevant document or documents containing the words. For example, when the seed document is a document 1 of xe2x80x9c . . .  User""s manner when the portable phone is in use becomes important.) . . . xe2x80x9d, words such as  (portable phone)xe2x80x9d, (manner)xe2x80x9d and  (important)xe2x80x9d are extracted to look up a word dictionary through morphological analysis. As a result, the system can search for a document 2 of xe2x80x9c . . .   (Use of portable phones in trains is banned) . . . xe2x80x9d containing  as a relevant document.
However, the prior art 1, which uses the word dictionary for word extraction, has two problems which will be mentioned below.
First one of the problems is that, when a word not listed in the word dictionary indicates seed-document""s essential contents (which will be referred to as central concept, hereinafter), there is impossibility of accurately searching for the document""s central concept even when similar searching is carried out with use of the other words, because the essential word cannot be extracted as a search word from the seed document. In other words, when information desired by the user is a new word, the new word not listed in the word dictionary results undesirably in search of a document or documents having concepts shifted from the target central concept.
The second problem is that, even when the word desired by the user is listed in the word dictionary, a document or documents having concepts shifted from the central concept may be undesirably searched depending on how to extract the word. For example, words such as , , and  are extracted from the above document 1 of xe2x80x9c . . .  . . . xe2x80x9d. However, there is undesirably a likelihood that a document 3 of xe2x80x9c . . .   (I got an advice about how to talk on phone) . . . xe2x80x9d is calculated low in its similarity because the word  cannot be extracted.
This results from the fact that search words are all extracted from the word dictionary.
The problems in the prior art 1 have been explained above.
For the purpose of solving the above problems, there has been suggested a technique (which will be referred to as the prior art 2, hereinafter) in Japanese Patent Application No. 9-309078, by which character strings each having n continual characters of a type (which strings will be referred to as the n-grams, hereinafter) such as xe2x80x98Kanjixe2x80x99 or xe2x80x98Katakanaxe2x80x99 are mechanically extracted according to the character types to search for a relevant document or documents, without using any word dictionary.
In the prior art 2, how to extract the n-gram is changed according to the character types to extract meaningful n-grams (which will be referred to as characteristic strings, hereinafter). For example, 2-grams are mechanically extracted from a character string of Kanji characters (which string will be referred to as a Kanji character string, hereinafter); while a character string of Katakana characters having the longest length (which string will be referred to as a Katakana longest character string, hereinafter), that is, a Katakana longest character string itself is extracted from character strings of katakana characters (which strings will be referred to as Katakana character strings, hereinafter). In this case, characteristic strings such as , , , , , and  are extracted from the above document 1 of xe2x80x9c . . .   . . . xe2x80x9d as a seed document. That is, since the character string  is also extracted without missing, even the document 3 of xe2x80x9c . . .   . . . xe2x80x9d can be extracted with a correctly calculated similarity.
In the prior art 2, however, there is a possibility of extracting even an n-gram across the words of a Kanji character string available to make a compound word from the Kanji character string. For this reason, use of this search method causes calculation of a similarity of such a document that is not similar to the seed document in contents, which results in a problem that such a document as not to be associated with the seed document is undesirably searched. For example, for the characteristic string of  extracted from the document 1 of xe2x80x9c . . .   . . . xe2x80x9d as a seed document, its similarity is calculated, which undesirably results in erroneous search of a document 4 of xe2x80x9c . . .   (In order to prevent charging, it must be grounded.) . . . xe2x80x9d as a relevant document.
For solving the above problem, there has been suggested a technique (which will be referred to as the prior art 3, hereinafter) for extracting a characteristic string using statistical information of 1-gram, as shown in a Journal of the Information Processing Society of Japan, pp. 2286 to 2297, Vol. 38, No. 11, November 1997.
In the prior art 3, with respect to each of 1-grams appearing in a document to be registered, a probability of 1-gram forming a head of a word (which probability will be referred to as a head-position probability, hereinafter) as well as a probability of 1-gram forming a tail of a word (which probability will be referred to as a tail-position probability, hereinafter) are previously calculated at the time of registering the document. In this case, it is assumed that a word consists of a string of an single type of characters such as Kanji or Katakana (which string will be referred to as a single character type string, hereinafter) and is delimited at a character type boundary such as the boundary between Kanji and Katakana, and that the 1-gram located directly after the character type boundary is regarded as a head 1-gram in a word and the 1-gram located directly before the character type boundary is regarded as a tail 1-gram in a word.
For example, with regard to the Kanji character string  delimited at a character type boundary and extracted from the above document 1 of xe2x80x9c . . .   . . . xe2x80x9d,  is a head 1-gram in the word and  is a tail 1-gram in the word.
For searching for a relevant document or documents, a single character type string is first extracted from a specified seed document. Next a probability of division of the single character type string between continual two of 1-grams in the single character type string (which probability will be referred to as a division probability, hereinafter) is calculated on the basis of a tail-position probability of front one of the continual two 1-grams in the single character type string and a head-position probability of rear one thereof. When the value of a calculated division probability exceeds a predetermined value (which will be referred to as a division threshold, hereinafter), the system performs division of the single character type string thereat.
Explanation will be made as to detailed processing operations of the prior art 3 with a division threshold of 0.050.
First of all, with respect to each of 1-grams appearing in all documents to be registered, the system counts an occurrence frequency, the number of times of occurrence at the heads of words (which will be referred to as the head-position frequency, hereinafter) and the number of times of occurrence at the tails of words (which will be referred to as the tail-position frequency, hereinafter) at the time of registering the documents and then stores the counted values in an occurrence information file. In the case of the above document 1, occurrence information is obtained of an occurrence frequency of  is 1, a head-position frequency is 1 and tail-position frequency is 0. FIG. 2 shows an exemplary occurrence information file.
Thereafter, looking up the above occurrence information file, the system calculates head and tail probabilities of each 1-gram and stores them in an occurrence probability file. For example, a head-position probability of 1-gram  is 768/4,740=0.16 and a tail-position probability of 1-gram  is 492/4,740=0.10. FIG. 3 shows an exemplary occurrence probability file.
Explanation will next be made as to how to search for a document or documents in the prior art 3 by referring to a single character type string of  as an example.
First 3 sets of 2 pairs of 1-grams of , , , , and ,  are extracted from the single character type string of . In each 1-gram pair, the system acquires a tail-position probability of front one of 1-grams and a head-position probability of rear one of 1-grams from the occurrence probability file previously created at the time of the document registration, and calculates a division probability based on the acquired head and tail probabilities.
FIG. 4 shows how to calculate division probabilities for the 3 1-grams extracted from . In this example, the division probabilities of (, , , ) and (, ) are calculated as 0.011, 0.054 and 0.005 respectively. Since 0.054 in the division probability of (, ) in these division probabilities is larger than the division threshold of 0.050, division is carried out between  and . On the other hand, the division probabilities of (, ) and (, ) are 0.011 and 0.005 respectively. Since these are smaller than the division threshold of 0.050, no division is carried out between these 1-grams. As a result,  is divided at between  and  into two characteristic strings of  and .
The detailed processing method in the prior art 3 has been explained above. In this way, consideration is paid in the prior art 3 not to search for a document or documents not similar in contents to the seed document, by extracting characteristic strings using 1-gram statistical information so as not to extract an unsuitable characteristic string across words.
However, the prior art 3 has a problem that, since the system judges division or non-division on the basis of the absolute value of the division probability, an extraction accuracy of the characteristic string as a word is low. For example, with respect to a single character type string of , the system extracts a pair of 1-grams of (, ) and calculates 0.054 as a division probability between the 1-grams.
Since the calculated value is greater than the division threshold of 0.050, division is erroneously carried out, as between  and  in  (which division will be referred to as the erroneous division, hereinafter), with the result that the system undesirably extracts unsuitable two characteristic strings. This leads to a problem that the system undesirably searches also for a document or documents related to  (xe2x80x98o-bixe2x80x99 in Japanese pronunciation)xe2x80x9d as a relevant document or documents.
As has been explained above, the word extracting method using the word dictionary as in the prior art 1 has a problem that, a word not listed in the word dictionary indicates the main concept of the seed document, the system unfavorably searches for a document or documents shifted from the main concept.
Further, the method for simply extracting n-gram from the single character type string according to the character type as in the prior art 2 has a problem that, since the system undesirably extracts n-gram across words from a Kanji character string available to make a compound word the system undesirably searches for a document or documents not associated with the seed document as a relevant document or documents.
Furthermore, the method for calculating the division probability using the 1-gram statistical information and judging division or non-division on the basis of the absolute value of the calculated division probability value as in the prior art 3 also has a problem, since the extraction accuracy of the characteristic string as a word is low, the system undesirably involves mixture of search noise, thus resulting in erroneous search of a document or documents shifted from the target main concept as a relevant document or documents.
In order to solve the above problems in the prior arts, it is therefore an object of the present invention to provide a method and system for extracting a characteristic string with less erroneous division.
Another object of the present invention is to provide a method and system for extracting a characteristic string with less erroneous division and thus with less search noise to realize searching of a relevant document or documents with less shift from the main concept of a seed document.
In order to solve the above problems, the characteristic string extracting method in accordance with the present invention extracts a characteristic string from a seed document through operations of steps which follow.
More specifically, The characteristic string extracting method of the present invention includes steps of registering a document and extracting a characteristic string from a seed document,
wherein the document registration step further includes steps of:
reading a document to be registered for document registration (step 1);
dividing character strings in the registered document read in the document reading step by character type boundaries between Kanji and Katakana to extract single character type strings (step 2);
with respect to each of the single character type strings extracted in the above single character type string extracting step, judging a character type thereof and when determining as a Kanji or katakana type, with respect to a predetermined length of n-gram in the registered document, counts an occurrence frequency, a frequency of occurrence as a word head (which will be referred to as the head-position frequency, hereinafter), a frequency of occurrence as a word tail (which will be referred to as the tail-position frequency, hereinafter), and a frequency of occurrence of the n-gram itself as a word (which will be referred to as the independent frequency, hereinafter) (step 3);
adding n-gram occurrence information counted by the above occurrence information counting step to occurrence information of the n-gram of the document already registered in a database to calculate occurrence information on the entire database and storing the calculated information in an associated occurrence information file (step 4);
with respect to the n-gram which was counted in its occurrence information in the above occurrence information counting step, acquiring occurrence information of the entire database from the associated occurrence information file to calculate a probability thereof as a word head (which will be referred to as the head-position probability, hereinafter), a probability thereof as a word tail (which will be referred to as the tail-position probability, hereinafter), and a probability of occurrence as the n-gram itself (which will be referred to as the independent probability, hereinafter) and storing the calculated probabilities in the associated occurrence probability file (step 5);
extracting a predetermined length of n-gram from the single character type string extracted in the above single character type string extracting step to count an occurrence frequency in the registered document (step 6);
storing the occurrence frequency counted in the above occurrence frequency counting step in an associated occurrence frequency file (step 7); and
extracting a characteristic string from a seed document,
wherein the characteristic string extracting step further includes steps of:
reading the seed document (step 8);
dividing a character string in the seed document read in the above seed document reading step by character type boundaries to extract single character type strings (step 9); and
with respect to the single character type string extracted in the searching single character type string extracting step, judging a character type thereof (step 10),
wherein, when the character type is of Kanji or Katakana, the system reads the occurrence probability file to acquire an independent probability of a character string ranging from the head of the single character type string to an i-th character, an independent probability of a character string of the head to (i+1)th characters, a head-position probability of the (i+1)th character, and a head-position probability of an (i+2)th character; calculates a probability of division of the single character type string at the i-th character (which will be referred to as a division probability, hereinafter) as a product of the independent probability of the character string of the head to the i-th characters and the head-position probability of the (i+1)th character; calculates a division probability at the (i+1)th character as a product of the independent probability of the character string of the head to the (i+1)th characters and the head-position probability of the (i+2) th character; compares the division probability of the i-th character with a division probability of the (i+1)th character to set larger one of the division probabilities as a single character type string division point (which will be referred to as the division point, hereinafter); the character type is not of Kanji or katakana, extracts the single character type string per se as the characteristic string; and repeats similar operations over the remaining character strings other than the extracted characteristic string to extract another characteristic string.
In order to attain the above objects, further, a method for searching for a relevant document or documents in accordance with the present invention extracts a characteristic string through the above steps to search for a document or documents similar to a seed document using the extracted characteristic string.
More specifically, the relevant document searching method of the present invention includes steps of registering a document and searching for a document or documents similar to a seed document,
wherein the document registration step further includes steps of:
reading a document to be registered for document registration (step 1);
dividing character strings in the registered document read in the document reading step by character type boundaries between Kanji and Katakana to extract single character type strings (step 2);
with respect to each of the single character type strings extracted in the above single character type string extracting step, judging a character type thereof and when determining as a Kanji or katakana type, with respect to a predetermined length of n-gram in the registered document, counting an occurrence frequency, a frequency of occurrence as a word head (which will be referred to as the head-position frequency, hereinafter), a frequency of occurrence as a word tail (which will be referred to as the tail-position frequency, hereinafter), and a frequency of occurrence of the n-gram itself as a word (which will be referred to as the independent frequency, hereinafter) (step 3);
adding n-gram occurrence information counted by the above occurrence information counting step to occurrence information of the n-gram of the document already registered in a database to calculate occurrence information on the entire database and storing the calculated information in an associated occurrence information file (step 4);
with respect to the n-gram which was counted in its occurrence information in the above occurrence information contain step, acquiring occurrence information of the entire database from the associated occurrence information file to calculate a probability thereof as a word head (which will be referred to as the head-position probability, hereinafter), a probability thereof as a word tail (which will be referred to as the tail-position probability, hereinafter), and a probability of occurrence as the n-gram itself (which will be referred to as the independent probability, hereinafter) and storing the calculated probabilities in the associated occurrence probability file (step 5);
extracting a predetermined length of n-gram from the single character type string extracted in the above single character type string extracting step to count an occurrence frequency in the registered document (step 6);
storing the occurrence frequency counted in the above occurrence frequency counting step in an associated occurrence frequency file (step 7); and
extracting a characteristic string from a seed document,
wherein the relevant document searching step further includes steps of:
reading the seed document (step 8);
dividing a character string in the seed document read in the above seed document reading step by character type boundaries to extract single character type strings (step 9);
with respect to the single character type string extracted in the searching single character type string extracting step, judging a character type thereof (step 10),
wherein, when the character type is of Kanji or Katakana, the system reads the occurrence probability file to acquire an independent probability of a character string ranging from the head of the single character type string to an i-th character, an independent probability of a character string of the head to (i+1)th characters, a head-position probability of the (i+1)th character, and a head-position probability of an (i+2)th character; calculates a probability of division of the single character type string at the i-th character (which will be referred to as a division probability, hereinafter) as a product of the independent probability of the character string of the head to the i-th characters and the head-position probability of the (i+1) th character; compares the division probability of the i-th character with a division probability of the (i+1)th character to set larger one of the division probabilities as a single character type string division point (which will be referred to as the division point, hereinafter); the character type is not of Kanji or katakana, extracts the single character type string per se as the characteristic string; and repeats similar operations over the remaining character strings other than the extracted characteristic string to extract another characteristic string,
counting occurrence frequencies of all characteristic strings extracted in the above characteristic string extracting step (step 11);
reading the occurrence frequency file for all the characteristic strings extracted in the characteristic string extracting step to acquire occurrence frequencies of the characteristic strings in each document in the database (step 12);
with respect to the characteristic strings extracted in the above characteristic string extracting step, calculating their occurrence frequencies in the seed document counted in the above within-seed-document occurrence frequency counting step as well as similarities between the seed document and the documents in the database on the basis of a predetermined computation expression with use of the occurrence frequencies of the documents within the database acquired in the above within-database occurrence frequency acquiring step (step 13); and
outputting a searched result on the basis of the similarities calculated in the above similarity calculating step (step 14).
The principle of the present invention based on the above document searching method will now be explained.
In the present invention, the steps 1 to 7 are carried out for document registration.
First of all, in the document reading step 1, the system reads a document to be registered. In the next single character type string extracting step 2, the system divides character strings in the registration document read in the above document reading step 1 by character type boundaries of Kanji or Katakana to extract character strings of a single character type. For example, single character type strings of , , , , , , , and  are extracted from the above document 2 of xe2x80x9c . . .   . . . xe2x80x9d.
In the occurrence information counting step 3, next, the system judges the character type of each of the single character type strings extracted in the single character type string extracting step 2, and if the character type is of Kanji or Katakana, the system counts an occurrence frequency of a predetermined length n of n-gram in the registration document, a head-position frequency, a tail-position frequency and an independent frequency thereof. For example, assume that the system counts occurrence frequencies, head frequencies and tail frequencies of 1-gram and 2-gram from Kanji and Katakana character strings. Then with respect to the single character type strings extracted in the single character type string extracting step 2, the system counts 1 for the occurrence frequency of , 1 for its head-position frequency, 0 for its tail-position frequency and 0 for its independent frequency, and counts 1 for the occurrence frequency of , 1 for its head-position frequency, 0 for its tail-position frequency and 0 for its independent frequency.
In the next occurrence information file creating/registering step 4, the system adds occurrence information of the n-gram already extracted in the occurrence information counting step 3 to occurrence information on the document already registered in the database and stores occurrence information as accumulated information in the associated occurrence information file. FIG. 5 shows an exemplary occurrence information file. The illustrated occurrence information file is an example in which the occurrence information extracted in the above occurrence information counting step 3 is stored. The illustrated occurrence information file shows information on an occurrence frequency of 4,740, an head-position frequency of 768, an tail-position frequency of 492 and an independent frequency of 42 for the 1-gram ; and also information on an occurrence frequency of 462, a head-position frequency of 419, a tail-position frequency of 52 and an independent frequency of 48 for the 2-gram .
In the occurrence probability file creating/registering step 5, the system calculates occurrence probabilities of n-grams whose occurrence information are stored in the occurrence information file creating/registering step 4, and stores the probabilities in the associated occurrence probability file. With respect to the 1-gram , for example, as shown in FIG. 5, the system counts 4,740 of its occurrence frequency, 768 for its head-position frequency, 492 for its tail-position frequency and 42 for its independent frequency, and thus calculates 0.16 (=768/4,740) for its head-position probability, 0.10 (=492/4,740) for its tail-position probability and 0.01 (=42/4,740) for its independent probability. FIG. 6 shows an exemplary occurrence probability file. The illustrated occurrence probability file shows an example when the occurrence probabilities extracted in the above occurrence information counting step 3 are stored. That is, the example shows information on a head-position probability of 0.16, a tail-position probability of 0.10 and an independent probability of 0.01 for the 1-gram , and also information on an head-position probability of 0.90, an tail-position probability of 0.11 and an independent probability of 0.10 for the 2-gram .
In the next occurrence frequency counting step 6, the system extracts a predetermined length of n-grams from all single character type strings extracted in the single character type string extracting step 2 and counts occurrence frequencies thereof in the registration document. And in the occurrence frequency file creating/registering step 7, the system stores the occurrence frequencies of the n-grams extracted in the above occurrence frequency counting step 6 in the corresponding occurrence frequency file.
FIG. 24 shows a procedure of operations of creating an occurrence frequency file with use of the aforementioned document 2 of xe2x80x9c . . .   . . . xe2x80x9d as an example.
First, in the single character type string extracting step 2, the system extracts all single character type strings from the document 2 as a registration document.
In the next occurrence frequency counting step 6, the system extracts a predetermined length of n-gram from all the single character type strings extracted in the above single character type string extracting step 2, and counts occurrence frequencies thereof in the registration document. In the illustrated example, it is assumed that the system extracts n-grams having lengths of 3 or less from the single character type strings. In this case, the system extracts ,  and  having a length of 1;  and  having a length of 2; and  having a length of 3 from  included in single character type strings 2404; and counts occurrence frequencies thereof in the document 2. As a result, the system counts 2 for the occurrence frequency of  in the document 2 and 1 for the occurrence frequency of  in the document 2.
In the occurrence frequency file creating/registering step 7, the system stores the occurrence frequencies of the n-grams extracted in the occurrence frequency counting step 6 in the corresponding occurrence frequency file. As a result, the system stores in the occurrence frequency file the occurrence frequencies of the n-grams from the document 2 in combination with an identification number of the registration document, in the form of (2,2) for 1-gram , (2,1) for 1-gram , (2,1) for 1-gram , (2,1) for 2-gram , (2,1) for 2-gram  and (2,1) for 3-gram . In this case,  (2,1) means that 2-gram  appears once in the document having an identification number 2.
For searching operations, the system executes the steps 8 to 14.
First, in the seed document reading step 8, the system reads the document 1 as a seed document. In the next searching single character type string extracting step 9, the system divides a character string in the seed document (document 1) read in the seed document reading step 8 by character type boundaries to extract single character type strings of single character types.
In the characteristic string extracting step 10, the system judges the character type of each of the single character type strings extracted in the searching single character type string extracting step 9.
If the character type is of Kanji or Katakana, the system reads the aforementioned occurrence probability file and acquires an independent probability of a character string of from a head to i-th characters in the single character type string, an independent probability of from the head to (i+1)th character, a head-position probability of the (i+1)th character and a head-position probability of the (i+2)th character. And the system calculates a division probability at the i-th character as a product of the independent probability of the character string of the head to i-th characters and the head-position probability of the (i+1) th character; and calculates a division probability at the (i+1)th character as a product of the independent probability of the character string of the head to (i+1)th characters and the head-position probability of the (i+2)th character. And the system compares the division probabilities of the i-th and (i+1)th characters, selects larger one of the probabilities as a division point, and extracts a character string of from the head to the division point characters as a characteristic string.
If the character type is not of Kanji or Katakana, then the system uses the single character type string itself as a characteristic string and repeats operations similar to the above to extract another characteristic string.
FIG. 8 shows an example of how to extract characteristic strings from the single character type string  extracted from the document 1. The system first calculates a division probability at the first character in  as a product of an independent probability of 0.01 for  and a head-position probability of 0.11 for , that is, 0.001 (=0.01xc3x970.11). Similarly, the system calculates a division probability at the second character as a product of an independent probability of 0.10 for  and a head-position probability of 0.36 for , that is, 0.036 (=0.10xc3x970.36). The system then compares these division probabilities and divides the single character type string by the character having the larger probability. In this case, since the division probability 0.036 of the second character is larger than the other, the single character type string  is divided into  and .
Also shown in FIG. 9 is an example of the single character type string  which cannot be divided suitably in the prior art 3, which will be explained in connection with dividing operations of the present invention. First, the system of the present invention calculates 0.0001 (=0.015xc3x970.01) for a division probability of the first character in  as a product of an independent probability (0.01) of  and an independent probability (0.01) of . The system also calculates a division probability at the second character, that is, an occurrence probability of  as a single character type string itself, as 0.10 for the independent probability of . The system compares these probabilities and divides the single character type string by the character having larger one of the probabilities into single character type strings. In this case, however, since the independent probability 0.10 of  is larger,  is divided at the second character, which means that the single character type string  is eventually not divided and extracted as a group.
In this way, since comparison of the division probabilities for the division of the single character type string enables word division accurately reflecting reflecting actual occurrence circumstances in the database, the present invention can reduce unsuitable division more remarkably than the aforementioned prior art 3 for performing the division based on the absolute values of the division probabilities.
In the within-seed-document occurrence frequency counting step 11, next, the system counts occurrence frequencies of the characteristic strings in the seed document, extracted in the above characteristic string extracting step 10.
In the within-database occurrence frequency acquiring step 12, with respect to the characteristic strings extracted in the characteristic string extracting step 10, the system looks up the above occurrence frequency file and acquires occurrence frequencies in the documents within the database.
In the similarity calculating step 13, with regard to the characteristic strings extracted in the characteristic string extracting step 10, the system calculates occurrence frequencies of the featured strings counted in the within-seed-document occurrence counting step 11 and in the within-database occurrence frequency acquiring step 12 as well as similarities thereof on the basis of the occurrence frequencies in the documents of the database.
For the calculation of the similarities, for example, such a similarity computation expression (1) as disclosed in JP-A-6-110948 and given below may be employed.
A similarity S(i) to document i is expressed as follows.                               S          ⁡                      (            i            )                          =                                            ∑                              j                =                1                            n                        ⁢                          xe2x80x83                        ⁢                          (                                                U                  ⁡                                      (                    j                    )                                                  xc3x97                                  R                  ⁡                                      (                    j                    )                                                              )                                                                          ∑                                  j                  =                  1                                n                            ⁢                              xe2x80x83                            ⁢                                                                    U                    ⁡                                          (                      j                      )                                                        2                                xc3x97                                                      ∑                                          j                      =                      1                                        n                                    ⁢                                      xe2x80x83                                    ⁢                                                            R                      ⁡                                              (                        j                        )                                                              2                                                                                                          (        1        )            
Where U(j) indicates a normalized weight for the j-th n-gram in the seed document and is calculated from occurrence frequencies of the n-grams in the seed document. R(j) indicates a normalized weight of the j-th n-gram in a document in the database and is calculated from occurrence frequencies of the n-grams of the documents in the database. The xe2x80x98normalized weightxe2x80x99 is an n-gram occurrence bias in the database. This means that the larger the value of the normalized weight is the n-gram appears as more biased to a specific document. How to calculate the normalized weight is explained in JP-A-6-110948 and thus explanation thereof is omitted herein. n indicates the number of all the documents in the database.
When the similarity S(i) for the document i is calculated using the similarity expression (1) when the document 1 is specified as the seed document, it results in:
S(1)=1.0
S(2)=0.262
S(3)=0.048
S(4)=0.0
As a result, the documents are arranged in an descending order of the similarities in the search result output step 14, documents 1, 2 and 3 are listed in this order. In this connection, the document 4 cannot be output as a search result because it has a similarity of 0.
As has been explained above, the similarity document searching method of the present invention based on the characteristic string extracting method can mechanically extract character strings from the single character type string without using any word dictionary as in the prior art 1. Therefore the present invention can perform searching operation without missing of any word and thus can accurately search for the concept of the seed document.
Further, unlike the prior art 2 for simply extracting n-grams from a single character type string according to the character types, the present invention extracts a group of meaningful n-grams on the basis of statistical information and can realize accurater searching of the concept of the seed document.
Further, unlike the prior art 3 for performing the division based on the absolute values of the division probabilities, the present invention compares the division probabilities and performs the division based on the larger probability. Accordingly the present invention can realize word division accurately reflecting actual occurrence circumstances in the database and can remarkably reduce the possibility of unsuitable word division. In this way, since the present invention can avoid searching of unsuitable characteristic strings when compared with the prior art 3, the invention can suitably search for the concept of the seed document and can search for a relevant document or documents at a high speed.