Pattern matching for determining whether or not a specific pattern exists in input data is an elemental technology in the field of information processing, and its applications are wide-ranging. For example, these applications include text search in a word processor, DNA analysis in biotechnology, detection of a computer virus lurking in email, and so forth.
As one means for implementing pattern matching, there is a method using a finite automaton (alias: a finite state device and a finite state machine).
A finite automaton for pattern matching is created from a pattern or a set of patterns.
As an example, an NFA (Non-deterministic Finite Automaton) and a DFA (Deterministic Finite Automaton) that accepts three types of patterns “ABC”, “CAB”, and “ABCD” will be described.
FIG. 1 is a view showing one example of an NFA. Also, FIG. 2 is a view showing one example of a DFA. The difference between the NFA and the DFA will be described later.
A finite automaton for pattern matching starts from an initial state, and makes a transition to the next state through a branch corresponding to an input character. When a state (shown by double circles in the drawing) corresponding to the last character is reached, it is considered that a pattern is detected.
The above operation is repeatedly performed for all the characters from the beginning to the end of a text.
There are two expression types of finite automaton: NFA and DFA.
The DFA is a finite automaton where once the current state and an input are determined, the next state is uniquely determined, as indicated by the word “deterministic”. Meanwhile, the NFA is a finite automaton where the next state is not uniquely determined.
For example, when putting a focus on the NFA as shown in FIG. 1 that is in state 0, there are three states: state 0, state 1, and state 2 as transition destinations corresponding to an input character ‘A’.
In a case where the NFA is operated on a sequential processing computer, when there exists a plurality of transition destinations from any given state, this state is put on a stack, and then one of the plurality of transition destinations is selected to make a state transition. Then, the NFA is tracked until there is no state transition or the end of the text is reached.
Afterwards, one of the states is extracted from the stack, a return is made to that state, and a transition destination different from the previous one is selected and a state transition is made.
The above operation is repeated until the stack becomes empty.
In the case where the NFA is operated on a sequential processing computer as described above, the behavior of turning back to a past state and restarting a state transition, that is, a backtracking is generated. Due to the effect of backtracking, the search speed based on the NFA is lower than that based on the DFA.
Meanwhile, the number of states and number of state transitions (number of branches) included in the DFA tend to be greater than those of the DFA. Therefore, the size of a memory for storing the DFA is greater than that of the NFA. Also, it is known that a large amount of computational effort is needed to create the DFA.
As discussed above, the only one downside with the NFA is a decrease in search speed caused by backtracking. Backtracking is generated by the restriction in which a plurality of transition destinations cannot be simultaneously searched in the sequential processing computer. That is, parallel processing is required to suppress backtracking.
Consequently, a method for representing an NFA by combinations of flip-flops and various gates (AND or OR), burying these combinations as a circuit in a device, such as an LSI, and performing pattern matching using the circuit is suggested in a paper titled “Fast Regular Expression Matching using FPGAs” by R. Sidhu and V. K. Prasanna, Field-Programmable Custom Computing Machines (FCCM), Rohnert Park, Calif., USA, April 2001.
By circuitizing the NFA as described above, backtracking, which is the drawback of the NFA, can be solved. This is because all the flip-flops and the gates are operable in parallel in the circuit.
FIG. 3 is a view showing one example of input patterns. Also, FIG. 4 is a view showing an NFA for accepting the patterns as shown in FIG. 3. Also, FIG. 5 is a view showing one example of a circuit diagram representing the NFA as shown in FIG. 4 by flip-flops and gates.
A regular expression is included in the three patterns “AB*C”, “A[B|C]”, and “CAB” as shown in FIG. 3. A regular expression is an expression that can define simple patterns.
“B*” included in the first pattern “AB*C” represents a sequence of zero or more Bs. Hence, the first pattern matches text “AC”, “ABC”, “ABBC”, and so forth.
“[B|C]” included in the second pattern “A[B|C]” represents B or C. Hence, the second pattern matches text “AB” and “AC”.
As shown in FIG. 5, an input of an NFA circuit 10 is character 22 which is an element of text 20 that is to be searched. The text 20 are sequentially given one by one to the NFA circuit 10 characters 22 starting from the leading character as characters 22.
The NFA circuit 10 sets a pattern detection signal 30-X (1≦X≦3) to 1 each time it detects an X-th pattern. Meanwhile, other pattern detection signals 30-1˜30-3 are set to 0. Also, two or more of the pattern detection signals 30-1˜30-3 may become 1 at the same time due to the non-deterministic characteristic of the NFA.
The circuitization of the NFA as shown in FIG. 4 is carried out through a state circuitization step and a state transition circuitization step.
In the state circuitization step, one state of the NFA is replaced by one flip-flop. When the state is effective, an output value of the corresponding flip-flop is 1.
A comparator for comparing a character that is a transition condition (=character given to a branch of the NFA) and a character 22 is placed in the state transition circuitization step. When both characters match, the comparator outputs 1.
A logical AND of the output of the comparator and the output of the flip-flop of a transition source is taken, and the logical AND is used as an input of the flip-flop of a transition destination. Also, when there exists a transition from a plurality of states to one state, a logical OR of the logical ANDs from a plurality of transition sources is taken, and the logical OR is used as an input of the flip-flop of a transition destination.
A pattern matching method using a circuitized NFA has an advantage of providing very fast search speed since a dedicated circuit for searching for desired patterns is configured.
However, the pattern matching method using a circuitized NFA has a problem in which if a plurality of patterns exist, it is difficult to identify patterns that match text.
Hereinafter, this problem will be described in detail.
The most simple pattern identification method is to find the values of pattern detection signals 30-1˜30-N outputted by the NFA circuit 10 as shown in FIG. 5 individually. Here, N is the number of patterns.
If the value of a pattern detection signal 30-X (1≦X≦N) is 1, this means that the X-th pattern is detected. In this method, there is a need to provide a circuit for checking all the values of 10000 pattern detection signals 30-1˜30-10000, assuming that the number of patterns is 10000. Thus, when the number of patterns is large, the feasibility of this method is low in terms of gate size, wiring capacity, and operation speed.
Consequently, as a more advanced pattern identification method, a conventional method using a priority encoder is disclosed in a paper titled “The design and implementation of a NFA pattern matching circuit for NIDS” by Ono Masato (Graduate School of System and Information Technology, University of Tsukuba) et all, IEIC Technical Report CPSY2004-17 (Institute of Electronics, Information and Communication Engineers).
The priority encoder is a circuit for encoding an input bit string to a numerical value. Generally, input N bits are converted into a numerical value between 0 and (N−1), and the numerical value after encoding is represented by log 2(N) bits.
Even when a plurality of bits in an input bit string becomes 1, each bit is given a priority to determine an output value. If a bit with a high priority is 1, a bit with a priority lower than that is ignored.
If no regular expression is included in the pattern, a bit string of pattern detection signals 30-1˜30-N is encoded to a numerical value by using the priority encoder, and the type of the pattern included in the text can be identified by referring to the numerical value after encoding.
For example, when N=8192, the numerical value after encoding is between 0 and 8192 and is represented in 13 bits (log 2(8192)=13).
That is, there is no need to directly refer to 8192 pattern detection signals 30-1˜30-8192, so the circuit scale is reduced.
However, in the conventional method using a priority encoder, it is not always possible to identify the type of pattern in a case where a regular expression is included in the pattern.
The reason for this will be described below by using a concrete example.
As described above, when using a priority encoder, a priority should be defined for each input bit. In other words, if a priority cannot be uniquely defined for each bit, the type of pattern included in the text cannot be accurately identified by referring to the numerical value after encoding.
FIG. 6 is a view showing combinations of values of pattern detection signals 30-1˜30-3 outputted by the NFA circuit 10 as shown in FIG. 5.
When putting a focus on the columns of pattern detection signal 30-1 and pattern detection signal 30-2 of input table 15 as shown in FIG. 6, there are four combinations of their values: 00, 01, 10, and 11. This means that either one or both of the values of pattern detection signal 30-1 and pattern detection signal 30-2 may be 1.
That is, it can be seen that because pattern detection signal 30-1 and pattern detection signal 30-2 are not in a subordinate relationship, priorities between them cannot be determined.
As explained above, in a pattern matching method using a circuitized NFA, no practical method has been established for identifying a pattern that matches text in a case where a regular expression is included in the pattern.