1. Field of the Invention
The present invention relates to a dictionary and index creating system for creating a machine-retrievable dictionary and index available for document managing systems, document editing systems and others which work to manage, edit and retrieve document information accumulated as electronic data, through the use of computers.
2. Description of the Related Art
Recently, owing to the widespread use of word processors, personal computers and large-capacity and low-cost storage media such as CD-ROM and the development of networks represented by Ethernet, the full-text (whole-passage) databases in which the character information in all or most of documents (texts) are expressed as character code strings and accumulated have come into practical and widespread use.
In the case of the prior document databases, the common way for the document retrieval (text search) involves the keyword retrieval making use of keywords prepared for each of documents. However, this way has caused problems such as difficulty in coping with the increase in the accumulated documents because of the troublesome keyword preparation work, the triteness of the keywords with the passage of time and the lack of relevant documents in retrieval result due to the difference in interpretations between the keyword preparing person and the retrieval conducting person. For these reasons, lately, interest has been shown toward the so-called full-text retrieval which does not require the keyword preparation.
The full-text retrieval is of the type performing the collation and matching in character string between the retrieval condition based upon a character string given from the user and a character string constituting the accumulated documents to output a document(s) satisfying the retrieval condition, whereupon there is no need to prepare keywords in advance. So far, various methods have been proposed as means to realize this full-text retrieval. The detailed description of the overall arrangement thereof has been disclosed by, for example, William B. Frakes and Ricardo Baeza-Yates (eds.), xe2x80x9cInformation Retrieval-Data Structure and Algorithms, Prentice Hall (1992), which it is roughly classified into the following three methods from a viewpoint of the index preparation prior to the retrieval to the documents undergoing retrieval or being the target of retrieval (hereinafter referred to as retrieval documents).
(1) Full-text Scan Method
(2) Signature File Method
(3) Transposition File Method
Of these methods, the full-text scan method involves making the matching or collation between the retrieval condition character string and the retrieval documents whenever a question takes place to bring the retrieval result, so that there is no need to previously prepare an index for the retrieval, thus saving the storage capacity and allowing the retrieval under complicated requirements. On the other hand, the retrieval speed is relatively slow as compared with the other methods, and from this viewpoint, the full-text scan method is not fit for a large amount of retrieval.
Furthermore, the signature file method (2) is such that a document file, so-called signature, is constructed in advance as an index for retrieval and this signature file is first retrieved to cut back the quantity of documents undergoing the full-text scanning. In comparison with the above-mentioned method (1), a high-speed retrieval becomes feasible, whereas in general this requires constructing and retaining the signature file constituting several tens % of the capacity of the retrieval documents.
Still further, the transposition file method (3) involves previously constructing as a retrieval index a document in which characters/words/n-character succession (n-gram) occur or appear or a transposition file recording the document positions therein so that the retrieval is made through the use of only this transposition file (that is, without the use of the retrieval documents). This method permits an extremely high speed retrieval as compared with the methods (1) and (2). However, in the case that the retrieval documents are written in Japanese, because the boundaries between the words are not clear unlike the western languages, this method requires several times the capacity of the retrieval documents when conducting the retrieval on the basis of the n-character succession.
Since each of the above-mentioned three methods has an advantage and a disadvantage, it is necessary to use them properly to match each of the document retrieval requests. For instance, for the retrieval of an extremely large volume of document including an extremely large number of characters, such as the whole text of an Unexamined Patent Publication, the high-speed retrieval is essential, and in this case, the above-mentioned method (3) is most suitable.
In order to apply the method (3) to a retrieval document based on the no-space languages (there is no space between words) such as Japanese and Chinese, a method of constructing a transposition file of one- or two-character succession to realize a high-speed document retrieval system has been proposed in xe2x80x9cA Fast Full-Text Search Method for Japanese Text Databasexe2x80x9d written by Chuichi Kikuchi, Electronic Information Communication Society Paper Magazine, Vol. J75-D-I, No. 9, pp.836-846 (1992). In addition, a method of constructing a transposition file of one to three-character succession for the preparation of an index when necessary has been proposed in xe2x80x9cDevelopment of n-gram Type Large-Scale Full-text Retrieval Methodxe2x80x9d written by Sugaya, Kawaguchi, Hatayama, Tada, Kato, Information Processing Society of Japan 53rd National Conference Pre-Draft Collection, 3-235, (1996).
However, according to the prior methods, the index file drawn up comes to twice the retrieval documents, and if increasing the number of characters organizing the character succession for the purpose of the speed-up, the capacity of the index file further increases, which creates the problem in that difficulty is encountered to realize them in the case that limitation is imposed on the usable capacity of a memory unit. Moreover, in the case of such a retrieval condition character string as xe2x80x9ckatakana (characters inherent in Japanese)xe2x80x9d with long character strings and many high-frequency character chains, the retrieval data amount in the index file increases, with the result that the retrieval speed reduces.
As one possible way to solve these problems, in the Japanese Unexamined Patent Publication No. 8-249354 there has been disclosed a method in which words are cut out even in the Japanese retrieval documents through the use of a large-scale word dictionary to constitute a transposition file as well as the western languages so that the full-text retrieval is carried out on the basis of an arbitrary retrieval condition character string through the use of the constructed transposition file at high speed. This method will be referred hereinafter to as a prior index retrieval method.
In the prior index retrieval method, a word index storing the occurrence (appearance) positions of character strings respectively matching with words in the retrieval documents and all of only the maximal (longest) index elements of the index elements paired with the words is constructed as a maximal extension index through the use of a word dictionary being a set of a definite number of words (character strings), thereby arranging index information by far smaller than an inverted file of n-character succession (n-gram string) and having a capacity similar to the capacity of the retrieval documents.
In the retrieval, word strings in the dictionary in which each of the characters in a retrieval condition character string is included in at least one of the words is obtained as a cover of the retrieval condition character string, and in terms of each of extension words of each of works including each of words organizing the cover, the set of index elements corresponding to that word are obtained, and of the strings of index element sets corresponding to the words, only the index element string appearing in succession in the retrieval documents is obtained and the matching start position of the leading (first) index element is outputted as a retrieval result. Owing to this retrieval, in case where the retrieval condition character string coincides with a word in the dictionary or in case where it can be covered with a small number of words in the dictionary which appear at a low frequency in the document, it is possible to conduct the full-text retrieval processing at a relatively high speed and further to considerably overcome the disadvantage of the aforesaid transposition file based on the character chain.
A description will be made hereinbelow of a prior word index creating method and prior document retrieval system according to the prior index retrieval method. First of all, the description will begin with the prior word index creating method. FIG. 27 is a block diagram showing the entire arrangement of a prior word index creating system. In FIG. 27, reference numeral 401 represents a word dictionary storing a finite or definite number of character strings, numeral 402 designates a retrieval document storage for storing retrieval documents undergoing retrieval for which the index preparation (indexing) is made, and numeral 403 denotes a longest match word retrieving means for retrieving a word organizing the longest leftmost partial character string of the specified character strings. Further, numeral 404 depicts a character number storage area for storing the number of characters of the retrieved word and for subtracting the stored value by 1 each time the observing retrieval document position advances by one character.
Moreover, numeral 405 signifies a maximal index element creating means for reading the retrieval documents from the retrieval document storage 402 and for driving the longest match word retrieving means 403 to appoint a character string corresponding to the character number of the longest word in the word dictionary 401 in the longest match word retrieving means 403 on the basis of the respective character positions of the retrieval documents 402 to successively retrieve the longest match words so that, if the number of characters being the retrieval result exceeds the value of the character number storage area 404, a set of (a group made by) the word and the occurrence character positional range is outputted as an index element and the character number being the retrieval result is stored in the character number storage area 404. Numeral 406 indicates an index element sorting means for sorting the sets of index elements outputted from the maximal index element creating means 405 at every word, and numeral 407 stands for a word index for storing the arrangement result of the index element lineup means 406.
An operation of the word index creating system thus arranged will be described hereinbelow with reference to the drawings using a simple dictionary and simple retrieval documents. FIG. 29 is an illustration of an example showing a list of words organizing a word dictionary taken in a dictionary type index retrieving method, FIG. 30 is an illustration of an example of retrieval documents, FIG. 31 is a conceptual illustration of processing for deriving maximal index elements from the FIG. 30 retrieval document through the use of the word dictionary composed of the words shown in FIG. 29, and FIG. 32 is a conceptual illustration of the contents of the word index drawn up from the FIG. 30 retrieval documents using the word dictionary comprising the words shown in FIG. 29.
First, prior to the index preparation, the dictionary data corresponding to the contents shown in FIG. 29 is stored in the word dictionary 401, and the FIG. 30 retrieval document data is put in the retrieval document storage 402. In addition, the character number storage area 404 is set to 0. Further, since the number of characters of the longest word of the FIG. 29 dictionary data reaches 7, the character string length which is designated from the maximal index element creating means 405 toward the longest match word retrieving means 403 results in 7.
In this case, the first 7 characters xe2x80x9cA NICHI DEN SHI NO DEN SHI (which respectively correspond to the Japanese characters (including xe2x80x9chiraganaxe2x80x9d characters, xe2x80x9ckatakanaxe2x80x9d characters and Chinese Characters, and each comprising a set of letters) using the alphabet, but not having the meanings in English and each Japanese character is represented as a character code, an EOC code or a JIS code)xe2x80x9d of the FIG. 30 retrieval document is read out by the maximal index element creating means 405 and is presented as a key of the retrieval to the longest match word retrieving means 403. In the word dictionary having the contents shown in FIG. 29, the longest leftmost character sub-string of the xe2x80x9cA NICHI DEN SHIxe2x80x9d, and the number of characters of this word is 4 which is larger than 0 set in the character number storage area 404, the index element (A NICHI DEN SHI, [1, 4]) is outputted to the index element lineup means 406, so that the value of the character number storage area 404 reaches 4.
Subsequently, 7 characters being xe2x80x9cNICI DEN SHI NODEN SHI SUxe2x80x9d taking place by advancing the observing character position of the retrieval document by one character are produced in the maximal index element creating means 405 and designated as a key to the longest match word retrieving means 403, thereby retrieving the word xe2x80x9cNICHI DENxe2x80x9d constituting the longest leftmost partial character string. Further, the value of the character number storage area 404 is decreased by one to come to 3. However, since the number of characters ofxe2x80x9cNICHI DENxe2x80x9d which is 2 is smaller than the value 3 of the character number storage area 404, it is found that this xe2x80x9cNICHI DENxe2x80x9d does not assume the maximal (included in xe2x80x9cA NICHI DEN SHIxe2x80x9d), with the result that no output of the index element occurs. The maximal index element creating means 405 conducts such an operation while shifting the observing character position in the sentence-end direction to output only the maximal index elements shown in FIG. 31 to the index element lineup means 406.
If the above-described processing reaches the end of the retrieval document, the index elements outputted therefrom are arranged in order in units of words in the index element sorting means 406, thus making out the word index shown in FIG. 32.
Secondly, a description will be taken hereinbelow of a prior document retrieval system using the prior word index drawn up by the above-described prior word index creating method. FIG. 28 is a block diagram showing one example of the entire arrangements of the prior document retrieval system. In this illustration, numeral 411 represents a word dictionary, numeral 412 designates a word index created in the FIG. 27 prior word index creating system using the word dictionary 411, and numeral 413 denotes a retrieval condition inputting means for inputting a retrieval condition character string. In addition, numeral 414 denotes a word cover calculating means for retrieving the word dictionary 411 to obtain a word cover of the retrieval condition character string in the form of the set of word cover elements being the combinations of the words in the dictionary and the cover character positional ranges of the retrieval condition character string. Further, numeral 415 depicts an extension word calculating means for retrieving the word dictionary 411 in relation to the word cover element specified to obtain all the words, coinciding with the retrieval condition character string, of the words in the dictionary which constitute the character strings including the words of the word cover elements.
Furthermore, numeral 416 signifies a matching character positional range set calculating means for obtaining all the index elements of the specified word sets to correct matching character positional ranges and further to create matching character positional range sets. Further, numeral 417 designates a connection matching character positional range string calculating means for obtaining all the matching character positional range strings, appearing in succession in the retrieval document, of the specified matching character positional range set string, numeral 418 depicts a matching position set calculating means for obtaining a set of matching start character positions which serves as the leading element of the matching character positional range string, and numeral 419 denotes a retrieval result outputting means for outputting the retrieval result.
An operation of the document retrieval system thus arranged will be described hereinbelow with reference to the drawings using the simple dictionary and retrieval document used in the above description of the prior word index creating system. FIGS. 33 and 34 are conceptual illustrations showing the full-text retrieval processing based upon a word index having the FIG. 32 contents and a word dictionary having the FIG. 29 contents in terms of a retrieval condition character strings xe2x80x9cDEN SHI su pi n KYO MEIxe2x80x9d and xe2x80x9cTO A DEN SHIxe2x80x9d (the capital letter string represents a Chinese character and the small-letter string underlined denotes a xe2x80x9ckatakanaxe2x80x9d character).
Referring to FIG. 33, the description will begin with the retrieval processing to be conducted for when the character string xe2x80x9cDEN SHI su pi n KYO MEIxe2x80x9d is inputted as the retrieval condition character string from the retrieval condition inputting means 413. First, the word cover calculating means 414 obtains the retrieval condition character string as follows. In a state where each of the rightmost partial character strings of the retrieval condition character strings: xe2x80x9cDEN SHI su pi n KYO MEIxe2x80x9d, xe2x80x9cSHI su pi n KYO MEIxe2x80x9d, xe2x80x9cDEN SHI su pi n KYO MEIxe2x80x9d, su pi n KYO MEIxe2x80x9d, xe2x80x9cpi n KYO MEIxe2x80x9d, xe2x80x9cn KYO MEIxe2x80x9d, xe2x80x9cKYO MEIxe2x80x9d, and xe2x80x9cMEIxe2x80x9d is taken as a key, the word cover calculating means 414 successively retrieves the words being the longest leftmost partial character strings of the keys in the word dictionary 411, and records, as the word cover elements, them together with the cover character positional ranges in the retrieval condition character strings.
In the case of this example, in terms of xe2x80x9cDEN SHI su pi n KYO MEIxe2x80x9d, 3 words xe2x80x9cDENxe2x80x9d, xe2x80x9cDEN SHIxe2x80x9d and xe2x80x9cDEN SHI su pi nxe2x80x9d are retrieved or picked up as the leftmost partial words thereof, and the (DEN SHI su pi n, [1, 5]) being the set of the xe2x80x9cDEN SHI su pi nxe2x80x9d whose number of characters is the largest and the cover character positional range [1, 5] of the retrieval condition character string xe2x80x9cDEN SHI su pi n KYO MEIxe2x80x9d is recorded, whereas the leftmost partial words of xe2x80x9cSHI su pi n KYO MEIxe2x80x9d are not recorded because of absence in the word dictionary 411 assuming the contents of FIG. 29. Further, for xe2x80x9csu pi n KYO MEIxe2x80x9d, the 2 words xe2x80x9csu pi nxe2x80x9d and xe2x80x9csu pi n KYO MEIxe2x80x9d are retrieved as the leftmost partial words so that (su pi n KYO MEI, [3, 7]) being the set of the longest word xe2x80x9csu pi n KYO MEIxe2x80x9d and the cover character positional range [3, 7] are recorded, whereas the leftmost partial words of xe2x80x9cpi n KYO MEIxe2x80x9d and xe2x80x9cn KYO MEIxe2x80x9d are not recorded because of absence in the word dictionary 411 assuming the contents of FIG. 29. Moreover, for xe2x80x9cKYO MEIxe2x80x9d, only the xe2x80x9cKYO MEIxe2x80x9d is retrieved as the leftmost partial word and the set (KYO MEI, [6, 7] being the combination with the cover character positional range [6, 7] is recorded.
Subsequently, the word cover elements not showing the maximal, that is, the word cover elements whose cover character positional ranges completely lie in the cover character positional ranges of the other word cover elements, are removed from the recorded word cover elements. After the removal, the set of remaining word cover elements cover the retrieval condition character string. More specifically, in the case that the sum-set of the cover character positional ranges of the respective word cover elements of the word cover set is the entire retrieval condition character string, the set of these remaining word cover elements are recorded as a word cover. If the set of word cover elements left after the removal does not cover the retrieval condition character string, the retrieval processing comes to an end after the retrieval result outputting means 419 outputs a predetermined special retrieval result indicative of xe2x80x9cretrieval impossiblexe2x80x9d.
In this instance, of the three index elements (DEN SHI su pi n, [1, 5]), (su pi n KYO MEI, [3, 7]) and (KYO MEI, [6, 7]), the cover character positional range [6, 7] of the (KYO MEI, [6, 7]) fully exists within the cover character positional range [3, 7] of (su pi n KYO MEI, [3, 7] ), and therefore, (KYO MEI, [6, 7]) undergoes removal. The remaining word cover elements produces the following set:
H={(DEN SHI su pi n [1, 5]), (su pi n KYO MEI, [3, 7])} and the sum-set of the cover character positional ranges thereof results in [1, 5]∪[3, 7]=[1, 7], which makes character positional range of the whole retrieval condition character string xe2x80x9cDEN SHI su pi n KYO MEIxe2x80x9d, so that the aforesaid H is recorded as the word cover for the retrieval condition character string xe2x80x9cDEN SHI su pi n KYO MEIxe2x80x9d.
After the word cover calculating means 414 derives the word cover for the retrieval condition character string, the extension word calculating means 415 obtains a set of extension words of the respective word cover elements being on word covering, which conform to or match with the retrieval condition character string, the xe2x80x9cextension word conforming to the retrieval condition character string cxe2x80x9d here signifies a word that, if defining a=min(sxe2x88x921, |p|), b=min(|c|xe2x88x92e, |q|) in terms of the word x=pxc2x7wxc2x7q (p, q denote an arbitrary character string other than number of characters=0) including the observing word cover element (w, [s, e]) as a partial character string, satisfies both:
a=0, or c[(sxe2x88x92a) . . . (sxe2x88x921)]=p[(|p|xe2x88x92a+1) . . . (|p|);xe2x80x83xe2x80x83(1)
and
b=0, or c[(e+1) . . . (e+b)]=q[1 . . . b].xe2x80x83xe2x80x83(2)
In this case, the partial character string from i-th character to j-th character of a character string T (the leading character is the first character) is expressed as T[i . . . j] and the number of characters of the character string T is expressed as |T|.
In this instance, the extension word set of (DEN SHI su pi n, [1, 5]) agreeing with xe2x80x9cDEN SHI su pi n KYO MEIxe2x80x9d makes {DEN SHI su pi n}, and the extension word set of (su pi n KYO MEI, [3, 7]) agreeing with xe2x80x9cDEN SHI su pi n KYO MEIxe2x80x9d becomes {su pi n KYO MEI, su pi n KYO MEI KYU SHU}. The word xe2x80x9cKAKU su pi n KYO MEIxe2x80x9d in the FIG. 29 word dictionary includes xe2x80x9csu pi n KYO MEIxe2x80x9d as a partial character string, and hence, is the extension word of xe2x80x9csu pi n KYO MEIxe2x80x9d. On the other hand, since the partial character string xe2x80x9cKAKUxe2x80x9d corresponding to p of the aforesaid x=pxc2x7wxc2x7q does not coincide with the corresponding partial character string xe2x80x9cSHIxe2x80x9d of the retrieval condition character string xe2x80x9cDEN SHI su pi n KYO MEIxe2x80x9d, it is not the extension word conforming to xe2x80x9cDEN SHI su pi n KYO MEIxe2x80x9d.
After obtaining the extension word set of the respective word cover elements being in word cover which is fit for the retrieval condition character string, in terms of the respective extension word sets, the matching character positional range set calculating means 416 obtains an index element which takes as the first term the word being the element of that extension word set from the word index 412, and corrects the second term of each of the obtained index elements to the matching character positional range corresponding to the word of the word cover element which produces that extension word set to attain the set of matching character positional ranges after the correction.
In the case of this example, the index element for the extension word set {DEN SHI su pi n} of (DEN SHI su pi n,[1, 5]) agreeing with xe2x80x9cDEN SHI su pi n KYO MEIxe2x80x9d is only (DEN SHI su pi n, [6, 10]), and the only one element xe2x80x9cDEN SHI su pi nxe2x80x9d of the extension word set is equal to the word xe2x80x9cDEN SHI su pi nxe2x80x9d which makes the extension word set, and therefore, the correction of the matching character positional range is unnecessary, and the matching character positional range set is obtained as {[6, 10]}. Similarly, the index element for the extension word set {su pi n KYO MEI, su pi n KYO MEI KYU SHU} of (su pi n KYO MEI, [3, 7]) agreeing with the xe2x80x9cDEN SHI su pi n KYO MEIxe2x80x9d is also only {(su pi n KYO MEI, [8, 12])}, so that the correction of the matching character positional range is unnecessary, the matching character positional range set results in {[8, 12]}.
After obtaining the matching character positional range set in terms of the respective extension word sets, the connection matching character positional range string calculating means 417 obtains, of the respective matching character positional range strings, all the matching character positional range strings appearing in succession in the retrieval document.
In the case of this example, the difference (that is, 2) in start character position between the element [6, 10] of the matching character positional range set {[6, 10]} to (DEN SHI su pi n, [1, 5]) and the element [8, 12) of the matching character positional range set {[8, 12]} to (su pi n KYO MEI, [3, 7] ) is equal to the difference (that is, 2) in start character position of the cover character positional range between the two corresponding word cover elements (DEN SHI su pi n, [1, 5]) and (su pi n KYO MEI, [3, 7]), and it is found to appear in succession in the character positional range [6, 12] of the retrieval document, thus obtaining {([6, 10], [8, 12]) } as a set of matching character positional range strings.
After all the matching character positional range strings occurring in succession in the retrieval document are obtained, the matching position set calculating means 418 subsequently obtains the matching position set being the set of the matching start positions of the first matching character positional ranges of the respective matching character positional range strings from the matching character positional range string set, and the retrieval result outputting means 419 this obtained matching position set as a retrieval result.
In the case of this example, the matching position set calculating means 418 obtains the set {6} composed of only 6 which indicates the matching start character position of the leading element [6, 10] of the string ([6, 10], [8, 12]) constituting only one element, and the retrieval result outputting means 419 outputs the obtained set as a retrieval result. This retrieval result is representative of that in the retrieval character string only one portion matches with the retrieval condition character string xe2x80x9cDEN SHI su pi n KYO MEIxe2x80x9d and it begins with the 6th character of the retrieval character string.
The description made above is about the retrieval processing to be taken for when the character string xe2x80x9cDEN SHI su pi n KYO MEIxe2x80x9d is inputted as the retrieval condition character string.
As shown in FIG. 34, the retrieval processing to be taken for when the character string xe2x80x9cTO A DEN SHIxe2x80x9d is inputted as the retrieval condition character string is basically similar to the above description. First, in FIG. 34, the set composed of three elements {(TO, [1, 1]), (A, [2, 2]), (DEN SHI, [3, 4])} is obtainable as the word cover. If expressing the extension word set of the respective word cover elements agreeing with xe2x80x9cTO A DEN SHIxe2x80x9d and the corresponding matching character positional range set as xe2x80x9cword cover elementxe2x86x92extension word setxe2x86x92matching character positional rangexe2x80x9d,
(TO, [1, 1])xe2x86x92{TO, KYOKU TO, KAN TO, HOKU TO, HOKU HOKU TO, NAN TO, NAN NAN TO}xe2x86x92{[16, 17], [18, 18]}
(A, [2, 2])xe2x86x92{A}xe2x86x92{[19, 19]}
(DEN SHI, [3,4])xe2x86x92{DEN SHI, DEN SHI su pi n, DEN SHI UN, DEN SHI MITSU DO, DEN SHI KI KI}xe2x86x92{[6, 10], [20, 21], [25, 27]}
Of the element strings of these three kinds of matching character positional range sets, the element string in which the character positions are in succession is only ([18, 18], [19, 19], [20, 21]), with the result that {18} is outputted as the retrieval result.
Although the description has been made above of two examples, in general, like the first example (xe2x80x9cDEN SHI su pi nxe2x80x9d is covered with two words xe2x80x9cDEN SHI su pi nxe2x80x9d and xe2x80x9csu pi n KYO MEIxe2x80x9d), in the case that the retrieval character string is covered with words having a relatively large number of characters and overlapping with each other, this prior document retrieval system can ensure effective retrieval processing, and for the following reasons.
(1) As the number of characters of a word being a word cover element increases, the number of extension words decreases, and the number of elements of the matching character positional range sets also tends to decrease, so that the computational complexity for obtaining the extension word set and the matching positional range set lessens.
(2) As the overlapping portion between the words constituting the word cover elements increases, the difference between the matching character positional range set of the corresponding extension words and the final retrieval result, that is, the complexity of the useless matching character positional ranges not contributing to the final retrieval result, lessens, thus reducing the complexity in the connection character positional range string calculating means 417.
In the case of the second example (the retrieval for xe2x80x9cTO A DEN SHIxe2x80x9d), the above-mentioned reasons (1) and (2) do not hold true, and the word cover elements consist of 1 to 2 characters and there is no overlapping portion. For this reason, as represented by (TO, [1, 1])xe2x86x92{HIGASHI, KYOKU TO, HOKU TO, HOKU HOKU TO, NAN TO, NAN NAN TO}, it is required to examine the index elements comprising a large number of extension words, and like {[8, 12], [22, 23], [27, 29]}, the matching character positional range set includes many useless elements not contributing to the final result, with the result that the efficiency lowers.
Accordingly, as described above, in the case that the retrieval character string is covered with words comprising a relatively small number of characters and making less overlapping portions, the efficiency of the retrieval processing of the prior document retrieval system employing the word index made out according to the above-described prior index creating method lowers as compared with the case that the retrieval character string is covered with words having a relatively large number of characters and establishing much overlapping portion.
Although it is possible to reduce the situations, in which the retrieval processing efficiency lowers, in a manner of increasing the number of words to be stored in the word dictionary, particularly adding to the word dictionary the frequently appearing words of the long units of words (compound words, phrases, or the like) appearing in the retrieval document, commonly limitation is imposed on the number of words to be stored in the word dictionary, and hence, difficulty is experienced to completely eliminate the reduction of the efficiency.
It is therefore an object of the present invention to provide a dictionary and index creating system and a document retrieval system which are capable of, even if the retrieval character string is covered with words comprising a relatively small number of characters and making less overlapping portions, preventing the reduction of the retrieval efficiency and further of carrying out the high-speed full-text retrieval processing without increasing the index capacity so much.
For this purpose, in accordance with the present invention, a dictionary and index creating system is arranged to create a regular expression dictionary and a word index on the basis of a retrieval document undergoing retrieval and a word dictionary, while a document retrieval system is arranged to retrieve a retrieval character string in the retrieval document through the use of the regular expression dictionary and word index created through the dictionary and index creating system. The dictionary and index creating system and document retrieval system thus arranged are capable of, even if the retrieval character string is covered with words comprising a relatively large number of characters and establishing less overlap with each other, carrying out full-text retrieval processing at a high speed to enhance the retrieval efficiency.
Accordingly, a dictionary and index creating system according to this invention comprises means for creating a regular expression dictionary on the basis of a retrieval document undergoing retrieval and a word dictionary according to a rule depending on each of words of the word dictionary, and means for creating a word index which is composed of a set of regular expression and matching character positional range and which is made by a collection of index elements deducible from other index elements.
Furthermore, a dictionary and index creating system according to this invention comprises means for creating a regular expression dictionary on the basis of a retrieval document undergoing retrieval and a word dictionary according to a rule depending on an occurrence frequency in a sample document, and means for creating a word index which is composed of a set of regular expression and matching character positional range and which is made by a collection of index elements deducible from other index elements.
Still further, a dictionary and index creating system according to this invention comprises means for creating a first word index on the basis of a sample document and a word dictionary, and means for creating a regular expression dictionary and a second word index on the basis of a word frequency in the first word index and a retrieval document undergoing retrieval.
Moreover, a dictionary and index creating system according to this invention comprises means for adding a terminal character to before and after a retrieval document undergoing retrieval as occasion demands through the use of an enlarged character set to produce an enlarged retrieval document.
Besides, a dictionary and index creating system according to this invention comprises means for, when a word composed of only arbitrary characters of a character set is not included in a word dictionary, preparing an expansion word dictionary by adding the word to the word dictionary.
On the other hand, in accordance with the present invention, a document retrieval system comprises a word dictionary storage unit, word dictionary retrieving means, a regular expression dictionary storage unit, regular expression dictionary retrieving means, a word index storage unit, word index retrieving means, question inputting means, word calculating means, extension regular expression set calculating means, index element set retrieving means, connection index element calculating means, matching position set calculating means, and retrieval result outputting means.
In addition, a document retrieval system according to this invention includes means for adding a terminal character to before and after a retrieval document undergoing retrieval as occasion demands through the use of an enlarged character set to prepare an enlarged retrieval document.
More specifically, in accordance with an aspect of the present invention, a dictionary and index creating system, designed to create a regular expression dictionary and a word index on the basis of a retrieval document undergoing retrieval and a word dictionary, comprises a retrieval document storage unit for storing a retrieval document composed of a lineup of a finite number of characters included in a predetermined character set, a word dictionary storage unit for storing a word dictionary in which registered are a finite number of words each being a lineup of one or more characters included in the character set, means for reading out one word w from the word dictionary in the word dictionary storage unit and further for making out one or more sets of regular expressions a, b indicative of sets of character strings having the same length except null sets on the character set according to a rule depending on the word w, a regular expression dictionary storage unit for joining the regular expressions a, b to before and after the word w to make out one or more regular expressions awb and further for collecting the regular expressions awb to produce a regular expression dictionary, different from the aforesaid word dictionary, according to a predetermined rule depending on the word w and even for storing the regular expression dictionary, means for retrieving a character string matching with a regular expression in the regular expression dictionary from the retrieval document storage unit and further for creating an index element comprising a set of the regular expression and a matching character positional range in the retrieval document, and a word index storage unit for storing a word index made out by a collection of the index elements decided as being non-deducible (inestimable) from other index elements. This dictionary and index creating system can create a regular expression dictionary and a word index which are capable of, when a retrieval character string is covered with words comprising a relatively small number of characters and establishing less overlap with each other, prevent the retrieval efficiency from lowering and further of carrying out higher-speed full-text retrieval processing without increasing the index capacity so much.
Furthermore, in accordance with another aspect of this invention, the above-mentioned dictionary and index creating system is made such that each of the regular expressions a, b to be joined to before and after each word w in the word dictionary takes a character class string or a null string. This also can create a regular expression dictionary and a word index which are capable of, when a retrieval character string is covered with words comprising a relatively small number of characters and establishing less overlap with each other, prevent the lowering of the retrieval efficiency and further of carrying out higher-speed full-text retrieval processing without increasing the index capacity so much.
Still further, in accordance with a different aspect of this invention, a dictionary and index creating system, made to create a regular expression dictionary and a word index on the basis of a retrieval document undergoing retrieval, a word dictionary and word frequency data, comprises a retrieval document storage unit for storing a retrieval document composed of a lineup of a finite number of characters included in a predetermined character set, a word dictionary storage unit for storing a word dictionary in which registered are a finite number of words each being a lineup of one or more characters included in the character set, a word frequency data storage unit for storing word frequency data indicative of an occurrence frequency of each of words of the word dictionary in a sample document comprising a lineup of a finite number of characters included in the predetermined character set, means for reading out one word w from the word dictionary in the word dictionary storage unit and further for making out regular expressions a, b indicative of sets of character strings having the same length except null sets on the character set according to a rule depending on the frequency of the word w in the word frequency data, a regular expression dictionary storage unit for joining the regular expressions a, b to before and after the word w to make out one or more regular expressions awb and further for collecting all the regular expressions awb made out for all the words in the word dictionary to produce a regular expression dictionary different from the aforesaid word dictionary and even for storing the regular expression dictionary, means for retrieving a character string matching with a regular expression in the regular expression dictionary from the retrieval document storage unit and further for creating an index element comprising a set of the regular expression and a matching character positional range in the retrieval document, and a word index storage unit for storing a word index made out by a collection of the index elements decided as being non-deducible from other index elements. This dictionary and index creating system can create a regular expression dictionary and a word index which allow a higher-speed retrieval as the word has a higher occurrence frequency in the sample document.
Moreover, in accordance with a different aspect of this invention, a dictionary and index creating system, made to create a regular expression dictionary and a word index on the basis of a retrieval document undergoing retrieval, a word dictionary and a sample document, comprises a retrieval document storage unit for storing a retrieval document composed of a lineup of a finite number of characters included in a predetermined character set, a word dictionary storage unit for storing a word dictionary in which registered are a finite number of words each being a lineup of one or more characters included in the character set, a sample document storage unit for storing a sample document comprising a lineup of a finite number of characters included in a predetermined character set, means for retrieving a character string matching with a word in the word dictionary from the sample document storage unit and further for creating an index element being a set of the word and a matching character positional range in the retrieval document to check whether or not the index element is deducible from other index elements and even for collecting the index elements decided as being non-deducible from the other index elements to produce a first word index, means for producing word frequency data in a manner that the number of index elements to each of words in the first word index is handled as a word frequency, means for reading out one word w from the word dictionary in the word dictionary storage unit and further for making out regular expressions a, b indicative of sets of character strings having the same length except null sets on the character set according to a rule depending on the frequency of the word w in the word frequency data, a regular expression dictionary storage unit for joining the regular expressions a, b to before and after the word w to make out one or more regular expressions awb and further for collecting all the regular expressions awb made out for all the words in the word dictionary to produce a regular expression dictionary different from the aforesaid word dictionary and even for storing the regular expression dictionary, means for retrieving a character string matching with a regular expression in the regular expression dictionary from the retrieval document storage unit and further for creating an index element comprising a set of the regular expression and a matching character positional range in the retrieval document, and a word index storage unit for storing a second word index made out by a collection of the index elements decided as being non-deducible from other index elements. This dictionary and index creating system can create a regular expression dictionary and a word index which allow a higher-speed retrieval as the word has a higher occurrence frequency in the sample document and the word dictionary.
In the above-mentioned dictionary and index creating system, the means for making out the regular expression according to the rule depending on the word w is composed of means for making out a regular expression composed of the word w through the use of 3N parameters being N frequency limit values, N left-side character classes and N right-side character classes if the occurrence frequency of the word w recorded in the word frequency data is below a first frequency limit value, means for joining a character class a being an element in an mth left-side character class set and a character class b being an element in an mth right-side character class set to the word w to make out regular expressions awb in relation to all the possible character classes a, b if the occurrence frequency of the word w recorded in the word frequency data is higher than a mth frequency limit value but is lower than a m+1th frequency limit value, and means for joining a character class a being an element in an Nth left-side character class set and a character class b being an element in an Nth right-side character class set to make out regular expressions awb in relation to all the possible character classes a, b if the occurrence frequency of the word w recorded in the word frequency data is more than an Nxe2x88x921th frequency limit value. This dictionary and index creating system can create a regular expression dictionary and a word index which allow a higher-speed retrieval as the word has a higher occurrence frequency in the sample document.
Furthermore, in the dictionary and index creating system, the sample document is made up of all or a portion of the retrieval document, thereby creating a regular expression dictionary and a word index which allow a higher-speed retrieval as the word has a higher occurrence frequency in the sample document.
Still further, in the dictionary and index creating system, an enlarged character set is used which is prepared by adding as a terminal character one special character not included in the retrieval document, and the terminal character is added to before and after the retrieval document as occasion demands to produce an enlarged retrieval document, so that the enlarged character set is employed as a character set while the enlarged retrieval document is used as a retrieval document. Accordingly, this can create a regular expression dictionary and a word index, which permits a high-speed retrieval, through the use of the terminal character.
Besides, in the dictionary and index creating system, further included are means for, if a word composed of only c which is an arbitrary character in a determined character set is not included in a given word dictionary, creating an extended word dictionary by adding that word to the word dictionary, and means for creating a regular expression dictionary and a word index through the use of the extended word dictionary as the word dictionary. Thus, through the use of the extended word dictionary produced by adding a one-character word thereto, it is possible to create a regular expression dictionary and a word index which are capable of a high-speed retrieval.
Moreover, in accordance with a still further aspect of this invention, there is provided a document retrieval system comprising a word dictionary storage unit for storing a word dictionary made by a collection of a finite number of words each being a character string on a given character set, word dictionary retrieving means for conducting retrieval to the word dictionary, a regular expression dictionary storage unit for storing a regular expression dictionary made on the basis of a retrieval document undergoing retrieval and being a finite number of lineups of characters included in the character set and the word dictionary, regular expression dictionary retrieving means for performing retrieval to the regular expression dictionary, a word index storage unit for storing a word index created from the retrieval document and the word dictionary, word index retrieving means for performing retrieval to the word index, a question inputting means for inputting as a question character string an arbitrary character string on the character set, word cover calculating means including means for calculating a word cover being a set of word cover elements for the question character string (the word cover element is a pair of a word constituting a partial character string of the question character string in the word dictionary and a cover character positional range, and a character at an arbitrary position in the question character string is included in the cover character positional range of an y one of the word cover elements being in word covering) and means for outputting a special retrieval result representative of xe2x80x9cretrieval impossiblexe2x80x9d to retrieval result outputting means if there is no word cover for the question character string, extension regular expression set calculating means for calculating an extension regular expression set for each of word cover elements under the word covering from the regular expression dictionary when a word cover is obtained (the extension regular expression set is a set of regular expressions including a first term word of each of the word cover elements of the question character string being in word covering, and a set satisfying to an arbitrary extension question character string including the question character string the two conditions: (a) including a regular expression matching with a character string in a second character positional range of the extension question character string, which includes a cover character positional range being a second term of the word cover element and; and (b) not including a regular expression other than the regular expression set, which matches a character string in a third character positional range of the extension question character string including the second character position range, in the regular expression dictionary), index element set retrieving means for conducting retrieval to the word index to obtain all index elements in which each of regular expressions of the extension regular expression set is taken as a first term, connection index element calculating means for obtaining all index element strings being elements of each of two or more index element sets and appearing in succession in the document, matching position set calculating means for obtaining a set of matching start character positions of second terms of index elements being leading elements of the index element strings to set it as a retrieval result, and retrieval result outputting means for outputting the retrieval result. This arrangement, when a retrieval character string is covered with words comprising a relatively small number of characters and establishing less overlap with each other, is capable of preventing the impairment of the retrieval efficiency and further of carrying out higher-speed full-text retrieval processing without increasing the index capacity so much.
In the above-mentioned document retrieval system, the word cover calculating means obtains a word cover having the smallest number of word cover elements, so that, when a retrieval character string is covered with words comprising a relatively small number of characters and establishing less overlap with each other, it is possible to prevent the impairment of the retrieval efficiency and further to carry out higher-speed full-text retrieval processing without increasing the index capacity so much.
In addition, in the above-mentioned document retrieval system, the word cover calculating means calculates a word cover where the minimum value of the length of the cover character positional range being the second term of the word cover element is the largest. Accordingly, in the case that a retrieval character string is covered with words comprising a relatively small number of characters and establishing less overlap with each other, it is possible to prevent the impairment of the retrieval efficiency and further to carry out higher-speed full-text retrieval processing without increasing the index capacity so much.