In recent years, document digitization has been in progressing a variety of fields, thus leading to requests for a method of effectively searching documents. A method of searching includes that of matching characters in a document with a pattern designated based on regular expressions. The regular expressions, which are set forth in, e.g., Non-Patent Document 1, are notation that represents a class of languages as referred to as regular languages. It is well known in the art that a string matching technique in which the regular expressions are used as search conditions is based on a DFA (deterministic finite automaton).
The string matching technique by the DFA is based upon a model of a state transition machine (an automaton). The state transition machine incorporates a state and a state transition function. The state transition function is a function that determines the next state with respect to a current state and input characters. In the string matching technique using the DFA, the state transition machine reads input text on a character-by-character basis, and makes a transition to the next state obtained by applying the state transition function to a current-state and input-characters tuple. This method allows matching to be performed by scanning the text only once without backtracking, thereby enabling high-speed string matching to be performed. When matching based on a plurality of conditions is performed, a finite automaton (Moore machine) with output is also employed such that the DFA is expanded and the output is defined for each state in order to distinguish conditions under which matching has successfully been performed.
While a DFA state transition function is determined by a regular expression that is to be a matching condition, there has been a procedure that temporarily converts regular expression into an NFA (nondeterministic finite automaton) and then the NFA, in turn, into the DFA, which is well known and set forth in, e.g., Non-Patent Document 1. The string matching technique by the DFA has a feature that provides high-speed processing, which, in contrast, has raised a drawback in that a state transition table for achieving the DFA state transition function tends to be enormous.
A matching condition shown in FIG. 52, which is disclosed in Patent Document 3, is taken as an example. Shown in FIG. 53 are, in a finite automaton with conventional output, a failed function and a state transition table that are generated from the matching condition shown in FIG. 52. In this way, the state transition table that retains 90 different tuples with respect to the state number of 18 and five character kinds, needs to be generated.
As a method to solve such problem, Patent Document 1 and Patent Document 2 illustrate a method of reducing memory capacity in the state transition table, by removing, after converting into the DFA the state transition table based upon the AC (Aho-Corasick) technique, from the state transition table the transition operations to an initial state and its next state. The string matching techniques shown in Patent Document 1 and Patent Document 2, however, does not permit a general regular expression to be a matching target because of the matching target being limited to fixed string keywords.
By defining the failed function in the DFA, Patent Document 3 also indicates a method of reducing the state transition table. The method shown in Patent Document 3 may in some cases result in another failure of transition in a state in which the transition has been once made by the failed function—in other words, transition failure could in some situations chain-react. In such a case, a problem has been that there is a need for references to be repeatedly made to the failed function, thus resulting in the matching speed being reduced.
The matching condition shown in FIG. 52, which is disclosed in Patent Document 3, is taken as an example. Shown in FIG. 54 are the failed function and the state transition table that are generated from the matching condition in FIG. 52, which is disclosed in Patent Document 3.
A case where the matching condition is that shown in FIG. 52 and an input string consists of “aaca,” is taken as an example.
The method disclosed in Patent Document 3 first initializes a state to State 1. Next, a first character “a” is read, the state makes a transition to State 3 as indicated in the column of the input character “a” in the line of State 1 in the state transition table. Then, a character “a” is read, in a similar fashion, the state making a transition form State 3 to State 6. Then, reading the third character “c” results in the state making a transition from State 6 to State 10. Since, when a fourth character “a” appears next, however, there exists no transition destination corresponding to the character “a,” the state first makes a transition to State 5 that is the transition destination when the state fails in its transition to State 10. Furthermore, because State 5 has no transition destination corresponding to the character “a,” the state makes a transition to State 2, which is the transition destination when the state fails in its transition to State 5. However, the fact that no transition destination corresponding to the character “a” exists in State 2 either causes the state to make a transition to State 1, which is the transition destination when the state fails in its transition to State 2. The fact that in State 1 there exists State 3 of the transition destination corresponding to the character “a” leads to the state transitioning to State 3. As described above, four times in total of references to the state transition table and their state transitions, are made with respect to a fourth input character, which requires a total of seven times of the state transition with respect to four input characters. In this manner, the method according to Patent Document 3 may in some cases repeat failures in state transitions and requires making a reference to a transition destination for every failure of the transition. A problem has been that, for this reason, the frequency of references of the state transition table increases, which results in matching performance degradation.
Non-Patent Document 1    E. J. Hopcroft, D. J. Ullman, “Formal Languages and their Relation to Automata,” Addison Wesley (1969)
Patent Document 1    Japanese Unexamined Patent Application Publication 2004-103035
Patent Document 2    Japanese Unexamined Patent Application Publication 2004-103034
Patent Document 3    Japanese Unexamined Patent Application Publication 2994926