1. Field of the Invention
The present invention relates to a string matching device, especially to a multi-stage parallel multi-character string matching device.
2. Description of the Related Art
A string matching algorithm proposed by Alfred V. Aho and Margaret J. Corasick—generally called AC algorithm—is an effective method of exact string matching, which is capable of locating all keywords in a string by a one-pass search. One of important applications of the AC algorithm is in network intrusion detection systems (NIDS), of which SNORT, for example, is a well-known one.
A search tree based on the AC algorithm is called AC-trie. Please refer to FIG. 1, which illustrates an AC-trie constructed according to a keyword set {enhappy, happy, happen, happygo}. In FIG. 1, circles having a number inside represent states; states of double circles represent output states—ie., when a state transition goes to a state of double circles (state 7, for example), it means there is a matching string (“enhappy happy”); solid lines represent goto functions and dash lines represent failure functions and the functions of the goto functions and failure functions will be explained below. In fact, each state except an initial state has a failure function to link to the initial state or another state. To keep the figure easy to read, the failure functions (of state 1, for example) linking to the initial state are not depicted in FIG. 1. When the failure function of a state links to another state but not to the initial state, it means a string represented by the preceding state contains a string represented by the subsequent state. For example, the failure function of state 7 links to state 12, and a string represented by state 7 contains a string represented by state 12 (the string represented by state 7 is “enhappy”, and the string represented by state 12 is “happy”).
When the AC-trie of FIG. 1 is used to perform a string matching, the initial state 0 is first set as current state, and an input string is processed with one character at a time. For each character of the input string, the goto functions of the current state is to be evaluated to locate a matched one to determine next state; if none of the goto functions is matched, then the state pointed by the failure function is assigned as current state to proceed with the string matching. Via the failure functions, every state can reach the initial state eventually. As the goto functions of the initial state cover all characters, next state can be surely determined via the goto functions and the failure function in processing each character. A matching cycle starts from receiving an input character and ends at next state is determined via the goto functions and the failure function. After having matched a character and determined a next state, the next state is assigned as current state to receive a next character to enter a next matching cycle. Based on the foregoing specification of matching process, it is known that by forming an AC-trie and the goto functions and failure function to perform a string matching, it only takes a one-pass search to locate all the occurrences of the matched key words, and a search time thereof is a linear function of a string length of the input string, i.e., O(n).
Please refer to FIG. 2(a), which illustrates a scenario of using the AC-trie of FIG. 1 to perform a string matching process. As illustrated in FIG. 2(a), when the string matching process reaching state 14 (7), the current state will transfer to state 2 (12) according to the failure function to proceed with the matching process. FIG. 2(b) illustrates a corresponding NFA (non-deterministic finite automata) string matching process of FIG. 2(a), which allows multiple states to be active simultaneously, wherein, state 0 is always active, i.e., as long as state 0 detects an input character matching key character e or h, the NFA will start a string matching process. In FIG. 2(b), first string matching process for matching “happen” and second string matching process for matching “enhappy” will proceed simultaneously when matching “en”; the second string matching process and third string matching process for matching “happygo” will proceed simultaneously when matching “happy”.
As can be seen from the foregoing specification, AC-trie has a form similar to DFA (deterministic finite automata), because only one state is active at a time. In the NFA string matching, however, it can be seen that in each matching cycle, states linked with a failure function will be active simultaneously. As all the states link to the initial state through failure functions, the initial state is therefore always active. As a result, AC-trie can take advantage of the failure functions to attain an effect similar to that of NFA. AC-trie in DFA form can only maintain one state active, which is advantageous for software construction because codes in a program are executed sequentially; however, due to a fact that there can be more than once of states transfer in processing a character, AC-trie in DFA form is disadvantageous for hardware construction.
If multiple states of an AC-trie are allowed to be active simultaneously, then it will operate in a manner of NFA without any failure function, which is advantageous for hardware construction. Besides, as each state of an AC-trie represents a unique string, a distance between a state and the initial state is defined as a depth of the state. At any given time, only one of states of a same depth is active, because the states of a same depth represent different strings of a same length. If there are more than one state of a same depth active simultaneously, then it will contradict with the definition of AC-trie and therefore won't happen. If states of a same depth are attributed to a same level, then, when the NFA is implemented by hardware, each level only requires a register for keeping states. For example, if the longest string in a set of keywords has a length of q, then at most q registers are needed to keep states of each level.
As mentioned above, AC-trie is suitable for processing data in character-oriented manner. That is, when AC-trie is used in a string matching, only one character but not multiple characters can be inspected at a time, and prior art hardware structures based on the AC-trie can only process a character per clock cycle, so that the highest number of characters processed per unit time is limited by a clock frequency of hardware.
Besides, as semiconductor technology continues to make progress, it is easy to design and develop hardware structures according to practical needs, and more circuits can be implemented in a same area. However, it is difficult to increase the operation speed of circuits, and circuits operating in high speeds tend to consume more power. Taking a general CPU for example, to solve the problems, multiple cores can be implemented in a chip to promote performance by providing a parallel operation. The same thing, if a hardware device for string matching can inspect multiple characters in a matching cycle, its performance will be greatly enhanced.