With the wide spread use of word processors and personal computers, development of large capacity and low price memory media such as CD-ROM, and advancement of Ethernet networking, database systems such as relational databases and full text retrieval databases have come to be widely used .
Databases handle a relatively short character string of several characters to hundreds of characters, such as a person's name, place name, organization name, address, classification code or part code as a symbol, storing a CSV list of symbols (a string of symbols connected by the comma “,” such as “MorishitaElectricIndustries, MorishitaCommunicationsIndustries, KyushuMorishitaElectric” as a field of trading partner company names) in one item (field) of the database and search for records which contain a complete match, a prefix match, a postfix match or an infix match to the query symbols and retrieve the record at high speed (retrieving the condition, for example, in the case of prefix matching, “retrieve the record containing a symbol starting with “Morishita” in the trading partner company names field”).
Of these four methods of matching an efficient retrieval method for complete matching and prefix matching is realized by using the data structure called TRIE (also known as radix searching tree) as mentioned in publications such as “Algorithm Vol. 2 (R. Seziwick, tr. by Kohei Noshita, et al., Kindai Kagakusha, 1992, ISBN 4-7649-0189-7, pp. 52-72) and “Algorithm and Data Structure Handbook (G. H. Gonnet, tr. Mitsuo Gen, et al., Keigaku Shuppan, 1987, ISBN 4-7665-0326-0, pp. 111-122). In addition, where postfix matching is needed, a TRIE may be constructed for data reversed in the symbol character sequence, and it may be retrieved.
If infix matching is desired, efficient retrieval processing is difficult by TRIE, and conventionally, for example, a method as disclosed Japanese Laid-open Patent No. Hei 3-42774 has been employed.
In the method disclosed in Japanese Laid-open Patent No. Hei 3-42774, when compiling a symbol dictionary, a symbol character string is divided character by character and dictionary information recording a pair of symbol number and appearance character position of corresponding character in symbol is created for every character, or when retrieving a symbol dictionary, a query character string is decomposed by character, dictionary information corresponding to each character is retrieved, and a set of symbol numbers identical in symbol numbers and consecutive in appearance character positions is issued a as retrieval result.
In this conventional compiling method of a symbol dictionary, however, when the types of symbols are more than tens of thousands, the symbol dictionary file to be compiled is more than twice as large as the symbol data to be retrieved, and it is difficult to utilize if the usable capacity of the memory device is limited.
Or in the conventional retrieving method of a symbol dictionary, if we retrieve a symbol which is long and contains many high-frequency characters, the quantity of intermediate data to be read out from the symbol dictionary is tremendous, and the retrieval speed is reduced due to such read operation and consecutive checking.
The disadvantage of a conventional retrieving method of a symbol dictionary may be somewhat alleviated by recording the symbol dictionary in every consecutive N characters or “N-gram” of plural characters, instead of the unit of creating and recording dictionary information for every characters, but in the case of retrieving a symbol such as “199800000123A” initialed by the year and followed by multiple digits of integers mostly composed of consecutive zeros, there are many symbols incidentally coinciding in the beginning 10 characters or more, and if N is about 2 to 4 in N-gram, the amount of data to be read out from the symbol dictionary is still large and the retrieval speed is reduced.
Further, by increasing the number N in the character chain, the types of appearing N character chains increase abruptly and it is hard to compile a symbol dictionary and the capacity of the compiled symbol dictionary increases due to the housekeeping information. In the conventional retrieval method of a symbol dictionary, when we retrieve a symbol which is long and contains many high-frequency characters, complete matching takes the longest processing time among the four matching modes, and in the application where complete matching occupies the majority of queries, the average retrieval speed is reduced.
Thus, in the conventional compiling method of a symbol dictionary, the symbol dictionary file to be compiled is more than twice as large as the symbol data to be retrieved, and it is difficult to utilize if the usable capacity of the memory device is limited.
Moreover, in the conventional retrieval method of a symbol dictionary, if we retrieve a symbol which is long and contains many high-frequency characters, the amount of data to be read out from the symbol dictionary is tremendous, and the retrieval speed is reduced.
If the number of character chains. N is increased, the types of appearing N character chains increase abruptly and it is hard to compile a symbol dictionary with small housekeeping information, and the capacity of the compiled symbol dictionary increases.
In a compiling method of a symbol dictionary of the invention, a meta-symbol dictionary gathering shorter symbols called “meta-symbols” for covering symbols in symbol data is compiled automatically, each symbol in the symbol data is covered with the meta-symbol in this meta-symbol dictionary, the information how each symbol is covered can be retrieved at high speed including up to infix matching by compiling the meta-symbol appearance information recorded in every meta-symbol, and the size of the compiled symbol dictionary can be reduced; and in a retrieving method of a symbol dictionary of the invention, a query string is covered with meta-symbols by retrieving the meta-symbol dictionary contained in the compiled symbol dictionary file, retrieval results of both right and left extension meta-symbols of the original covering meta-symbols are added to this covering result and high speed retrieval is possible for all matching modes including infix matching by seeking the symbol number set commonly contained in every element set in the query string or covering results covering the right and left extension character strings, and moreover in the application where complete matching occupies the majority of queries, symbol retrieval is possible without decreasing the average retrieval speed.