Up to now, an operation for retrieving information on a character string from a dictionary has been widely conducted, particularly in the area of text processing such as, for example, spelling correction, kana-kanji (Japanese syllabary) transformation, and keyword retrieval. Hence, the data structure and the method of dictionary retrieval have become primary factors in determining the speed of processing. Therefore, a structure and a method are desired which are capable of realizing high-speed retrieval while, at the same time, keeping the cost of space low enough to be at an acceptable extent.
Among the methods proposed to do this, the data structure that is used the most is called a TRIE. It is known that the TRIE is a type of tree structure, and that the time required for using a dictionary depends almost only on the length of the input character string, and that the data compression efficiency is relatively good. The TRIE is disclosed in Knuth D. E., The Art of Computer Programming, Vol. 3, Chapter 6, Sorting and Searching, Addison-Wesley, 1973.
FIG. 1 schematically shows the TRIE dictionary giving a kana character (Japanese syllabary) name as an example. In FIG. 1(A), is shown an example of the portion corresponding to a name which starts with the syllable "ma," e.g., Matsushita, Matsuki, Matsuda, Masuda, and Matsushima.
Words in which the left-substrings, viewed from their start, are the same are collected into one and expressed as a tree structure in which one character corresponds to one node. A character aggregation that can follow one substring is connected to a child link extending from a parent node (the node corresponding to the end character of the above-described left substring), and the elements of the character aggregation are interconnected by sibling links. In the example in this figure, the character aggregation that can be the start of a word is first connected through a child link of a route node 10 by sibling links, and a list is constructed. Next, in correspondence to the character aggregation in this example, "su" and "tsu," which can follow the first character "ma," nodes 13 and 14 exist through a sibling link 12 ahead of a child link 11 of "ma." In other words, the first two characters of a word having the left substring "ma" are to be collected into a tree structure having the node 11 as a parent and the nodes 13 and 14 as children. Ahead of the child link 15 of the node 14 are nodes corresponding to three characters, "ki," "shi," and "da," which is a character aggregation that can follow the left substring "matsu" which are interconnected by sibling links 16 and 17. Thus, the TRIE is one which expresses a word aggregation with nodes corresponding to the characters and child and sibling links connecting the nodes.
Retrieval begins with an operation for extracting the first character of an input character string and then retrieving a character aggregation connected to the child link extending from the route node of a dictionary. More particularly, the input character is compared with the node of the dictionary (the character corresponding to that node) while tracing the sibling links in sequence. If they coincide with each other, the child link of that node is traced and the next input character of the input character string is compared with the character aggregation that follows. If they do not coincide with each other, a character that coincides with the first character of the input character string is to be retrieved while further tracing the sibling links continues. There are some cases where, in a very large TRIE, the length of the sibling links becomes long and a substantial time is required for retrieval but, in such cases the TRIE is used together with a hash method by a certain number of characters from the first character. In addition, if a word set (subtree) corresponding to a first character or hash value is stored in a continuous region, even if the TRIE is in external storage, the number of random accesses will be relatively small and there will be no possibility that the access time will increased greatly. Therefore, as long as the sibling links are not excessively long, all of the words with various lengths can be retrieved at high speed even if the word length is not clear.
It is to be noted that the retrieval character string will hereinafter be referred to as a character candidate lattice as circumstances require. That meaning, for example, is explained by taking as an example a normal expression such as [maarakaku]tsu[shimi][a-ko]. More particularly, in the normal expression such as [maarakaku]tsu[shimi][a-ko], each of [maarakaku], tsu, [shimi], and [a-ko] expressing a character or characters that coincide individually and sequentially is called a column and, as will become clear, each column can specify not only a single character but a plurality of characters. Then, if characters that can coincide in the column are extended in a longitudinal direction and the row of the column is extended in a transverse direction, a two-dimensional spread will be obtained. This is why the character string is called a lattice.
The above description was made of the retrieval of a settled character string, a description will hereinafter be made, with the aid of the candidate character string lattice defined above, about a case in which ambiguity is contained in the input character string, such as the case in which it is unclear if there is the possibility that a plurality of characters is in one column, i.e., the case in which there is a wild card character that matches with arbitrary character. In an actual application such as the case in which an ambiguous retrieval is required as a result of character recognition or when users do not remember the spelling of a word correctly, a retrieval request for such input occurs frequently. This input can be expressed generally by a normal expression. For example, a normal expression in which the first character is "ma" or "ya," the second character is ambiguous, and the following characters are "shita" becomes "[maya]?shita." In order to accept such input in the retrieval of a dictionary expressed with the TRIE:
If there is the possibility of a plurality of characters, the sibling links are traced for each. If found, the child links are traced and character comparisons of the next columns are performed in parallel.
In the case of a wild card matching all characters, the child link is traced for all of the characters of the character aggregation and the character is compared with that in the next column.
If it is now assumed that there is the possibility that, with respect to an input character, n[i] characters exist at a certain column i, and it is also assumed that, in a nest corresponding to column i in a dictionary, N[i] characters exist per node, the work quantity at the column i will become a value proportional to T[i], which is expressed by the following equation: ##EQU1## where F[i-1] represents the number of branches when moving from column i-1 to column i, i.e., the number of child links traced, and E(x) represents the expected value of x. Further, strictly speaking, the number of branches F[i] in column i depends on the frequency of each character with respect to each column, since F[i] is a number that represents this during the character comparison, the column goes ahead with the degree of coincidence. If it is assumed that the frequency is almost constant, since the number of branches F[i] can be considered to be proportional to the product of the previous number of branches (F[i-1]), the number of characters (n[i]) corresponding to the columns in an input character string, and the expected value E(N[i]) of the number of characters connected to current nodes in the dictionary, divided by the number of categories of all characters (Nc), the F[i] can expressed as follows: EQU F[i]=n[i]/Nc.times.E(N[i]).times.F[i-1] [Equation 2]
where F(0)=1.
As will be clear from these equations, the work quantity T[i] is proportional to a sequential product, F(0).times.F(1).times.. . . .times.F[i-1], so that the total of the work quantities depends upon how the number of branches F[i] is reduced. It is to be noted that, in Equation (2), n[i] is a value dependent on only the retrieval character string lattice, while E(N[i]) depends only upon the TRIE structure and is a value independent of the retrieval character string lattice.