1. Field of the Invention
The present invention relates to an apparatus and method, for use in a character string retrieving device, etc., of collectively determining whether or not at least one symbol string exists in data in a given text, etc.
2. Description of the Related Art
Recently, users require a process of collectively determining whether or not symbol strings exist in data in a given text, etc. A symbol string refers to a sequence of characters and other symbols. That is, a character string is a variation of a symbol string. The determination function is often referred to as a multi-string matching or a multi-string retrieval.
Efficient methods used with a conventional multi-string matching apparatus can be the Aho Corasick (AC) method suggested by Aho et al.(A. V. Aho and M. J. Corasick: "Efficient String Matching: An Aid to Bibliographic Search" CACM Vol. 18 No. 6 1975), the method of structuring a Deterministic Finite Automaton (DFA) corresponding to the AC method, and the Flying Algorithm for Searching Terms (FAST) method suggested by Uraya ("High-speed Character String Collating Algorithm FAST" published in Information Processing Academy Publication Vol. 30 No. 9 1989, Japanese Laid-open Patent Publication S64-74619, Japanese Patent Application S62-231091).
Described first are the AC method and the multi-string matching algorithm obtained by formatting the AC method as a DFA.
The AC method collates character strings using deterministic finite automaton called Pattern Matching Machine (PMM).
The collating operation according to the AC method is described as follows.
First, a state number is set to 1 as an initial value. Then, symbols are read character by character from an input text. It is next determined to which state the current state is switched by an input symbol. When a transition is not assigned to the current state using the input symbol, it is assumed that the collation has failed and a transition is directed to a possible state in case of transition failure (a failure state) from the current state. If a transition is not assigned to the possible state using an input symbol, then a transition is directed to a further possible state in case of transition failure.
Since transitions are assigned for all symbols from the initial state 1, the transition failure stops at the initial state at worst. Thus, the transition is repeated for each input symbol from the text. When a symbol string to be accepted is defined for a state, the symbol string and its position in the text are output.
FIG. 1A shows the PMM of the AC method in which three symbol strings {ab, bc, bd} are used as retrieval keys (keys). The PMM shown in FIG. 1A contains 6 states, that is, 1, 2, 3, 4, 5, and 6. Solid line arrows show a next transition state. Broken line arrows show possible states in case of transition failure. " a,b" indicates an input symbol other than a and b. The states 4, 5, 6, (s4, s5, s6) are assigned "ab", "bc", and "bd" as output keywords respectively.
FIG. 1B shows the operations of the PMM responsive to the input symbol string "cabcz". When the symbol c is input, it corresponds to an input symbol other than a and b. Therefore, the next state remains 1 and no output is generated. Next, when the symbol a is input, the current state is switched to the next state 2. When the symbol b is input, it is switched to the next state 4. At this time, the symbol string "ab" defined for the state 4 is output.
However, since the state 4 is not assigned its next transition state, it is temporarily switched to a possible state 3 in case of transition failure when the symbol c is input, and then a next transition state is searched for. Since the state 5 is defined as the next state by the symbol c, the transition is directed to this state and the symbol string "bc" is output. When the symbol z is next input, the transition is directed to the state 1, thereby terminating the entire operation.
Thus, according to the AC method, the transition is repeated each time a transition failure occurs by an input symbol for which the next state is not defined. Therefore, n input symbols bring less than 2n transitions using a finite state machine (finite state automaton). Normally, the probability that the leading character of a key makes a hit increases with an increasing number of keys. Since the number of transition failures increases correspondingly, the collating speed in the AC method gradually becomes slower with the increasing number of keys.
As described above, the process speed of the AC method is slowed by a transition failure for which the next state is not specified. According to the DFA method, a next state is uniquely determined in response to an input symbol. Therefore, n input symbols constantly bring n transitions using a finite state machine and a collation is performed at a high speed. Aho, et. al disclose the method of converting the state transition machine of the AC method into the DFA method.
FIG. 1C shows the finite state machine corresponding to the state transition machine in the AC method in response to the symbol string {ab, bc, bd}. In FIG. 1C, the "state" indicates the current state, and the "next" indicates the next state reached when the symbol in the "input" is entered. States s1, s2, s3, s4, s5, and s6 respectively correspond to states 1, 2, 3, 4, 5, and 6. For example, the representation ".right brkt-top.a,b" indicates symbols other than a or b.
FIG. 1D shows the operations for the input symbol string `cabcz` of the finite state machine. The initial state is 1. No transition failures as shown in FIG. 1B are detected in the state transitions shown in FIG. 1D. The number of state transitions matches the number 5 of the symbols contained in the input symbols "cabcz".
Also in the FAST method known as a high-speed collating method, character strings are collated as in the AC method by preparing a PMM for an input key set. The collating operation according to FAST method is described as follows.
First, the state number is set to 0 as an initial state. The collation start position in the input text is set at a position apart from the beginning of the text by a shortest key length, with the length of the shortest key in an input key set defined as the shortest key length.
Next, data is read character by character from the collation start position to the left on the text. It is determined according to the input symbol to which state the current state is switched. If the transition is not defined, the collation start position is shifted to the right by a predetermined amount corresponding to an input symbol, and then the collation is restarted.
Thus, the text is scanned from right to left as long as the state transition is defined for an input symbol. Then, the pattern of a character string is extracted. When the transition is not defined, the collation start position is shifted to right in the text by the amount of shift defined for the input symbol.
FIG. 2A shows the PMM of the FAST method in which three symbol strings {state, east, smart} are used as retrieval keys. The PMM shown in FIG. 2A comprises 14 states, that is, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, and 13. Solid line arrows indicate a transition direction, and broken line arrows indicate a shifted direction.
A state transition is defined in reverse order of the sequence of symbols contained in each retrieval key. The "Depth" indicates the depth of each state of the PMM. For the states 5, 9, and 13 (S5, S9 and S13), the symbol strings "state", "east", and "smart" are respectively defined as output keywords.
FIG. 2B is a table showing next transition states and the amount of shift when a symbol is input to each state of the PMM. In FIG. 2B, the number in the first row indicates the state number, and the symbol in the first column indicates an input symbol. The "(Other)" indicates an input symbol other than a, e, m, r, s, and t. On this table, a positive number element indicates the state number of a next transition state corresponding to the input symbol. A negative number element indicates the amount of a shift obtained through a corresponding input symbol.
FIG. 2C shows the operation of the PMM corresponding to the input symbol string `aaseastate`. The initial state is 0. In this case, "east" is the shortest symbol string of the symbol strings "state", "east", and "smart". Since its length is 4 characters, the shortest key length is 4. Thus, a collation process is performed from right to left with the position t the shortest key length of 4 characters away from the rightmost point of the input symbol string defined as a collation start position.
If the collation process fails, then an absolute value of the amount of the shift is calculated by multiplying the amount of the shift defined for the input symbol by -1, and the collation start position is shifted to the right by the resultant amount. Then, the state number is set to 0 and the collation is resumed.
When the symbol t is entered at the initial state 0 written at the position of the symbol t of the input symbol string shown in FIG. 2C, a transition is directed to the state 6 according to the table of FIG. 2B. When the symbol s is input, a transition is directed to the state 7. Next, when the symbol a is input, a transition is directed to the state 8. When the symbol e is input, a transition is directed to the state 9 and the symbol string "east" defined for the state 9 is output.
When the symbol s is input next, the collation start position is shifted to the right by 7 because--7 is set in the state 9 as the amount of a shift instead of a next transition state. Then, control is returned to the initial state 0 and the collation is resumed with the position of the symbol e after a shift operation as a new collation start position. The collation process continues as described above, and the symbol string "state" is output when a transition is directed to the state 5.
The above described process of retrieving multiple character strings is performed for a database, a word processor, and a device such as a full-text search device, etc.
A full-text search device refers to an apparatus for retrieving a character string to check whether or not a retrieval result is correct when data is retrieved using a full-text search index. A full-text search index refers to an index for use in a retrieval process through which returned data is not always a correct answer to a key word containing an input index such as a signature file, an inverted file having no word appearance positions in text, etc.
For example, assume that a keyword `John Smith` is retrieved from an English index. Since an index is a collection of words delimited by a space, `John Smith` is equal to `John AND Smith`. However, if a document is searched with a search condition of `John AND Smith`, the retrieval result also contains `John` and `Smith` which appear separately, thereby obtaining excess results. In this case, whether or not the result is correct can be checked by retrieving a character string.
A problem with the above described conventional character string collating method arises in a speed-capacity relation in a portion corresponding to the state transition of the PMM.
In the AC method, the storage capacity can be reduced using a list structure to represent a state transition portion. However, pointers should be sequentially traced in the list structure, and therefore data are slowly accessed and the collating operation is performed at a lower speed.
Although the collation speed of the DFA-processed AC method is high, the table structure as shown in FIG. 1C should be used to indicate all state transitions defined for all input symbols. However, this requires a considerable storage capacity.
Assume that there are 256 (8-bit code) types of input symbols with the number N of states and 4-byte pointers. In a table form, a state requires 256 pointers to the next state, one pointer to a possible state in case of transition failure, and one pointer to an output symbol string. Therefore, a storage capacity of N.times.(256+1+1).times.4 bytes is required.
Normally, since the number of states N increases with the number of retrieval keys, a required storage capacity becomes large when the number of keys is large. Therefore, it is not practical to design a character string collating device based on the DFA processed AC method.
Since the state transition or shift is similarly defined for all input symbols in the FAST method, the table structure as shown in FIG. 2B is required. Therefore, with the configuration of the character string collating apparatus based on the FAST method, a considerably large storage capacity is also required.