Pattern matching algorithms, which detect the occurrence of a pattern in an input string of characters, are widely used in information retrieval applications (e.g., data mining, bibliographic searching, search and replace text editing, and word processing) and in content inspection applications (e.g., network intrusion detection systems, virus/worm detection using signature matching, IP address lookup in network routers, and DNA sequence matching).
For many applications, it is necessary to search an input string for multiple patterns. A conventional multi-pattern matching algorithm is the Aho-Corasick (AC) algorithm. The AC algorithm locates all occurrences of a number of patterns in an input string by constructing a finite state machine that embodies the patterns. For example, this algorithm can be used to detect virus/worm signatures in a data packet stream by running the data packet stream through the finite state machine character by character (e.g., byte by byte).
The AC algorithm constructs the finite state machine in three pre-processing stages commonly referred to as the goto stage, the failure stage, and the next stage. In the goto stage, a deterministic finite state automaton (DFA) or search trie is constructed for a given set of patterns. The DFA constructed in the goto stage includes various states for an input string, and transitions between the states based on characters of the input string. Each transition between states in the DFA is based on a single character of the input string. The failure and next stages add additional transitions between the states of the DFA to ensure that a string of length n can be searched in exactly n cycles. More specifically, the failure and next transitions allow the state machine to transition from one branch of the tree to another branch that is the next best (i.e., the longest prefix) match in the DFA. Once the pre-processing stages have been performed, the DFA can then be used to search any target for all of the patterns in the pattern set.
During the search stage, the AC DFA processes one character or byte at a time (e.g., in a serial fashion), and each state transition is stored in a memory. Accordingly, the AC DFA transitions to a different state based on each character of the input string. Thus, for each character in an input string, a memory lookup operation is performed to access the goto transitions from the current state of the AC DFA, which are then compared with the input character to determine the next state.
Content inspection systems deployed in a network need to detect the presence of multiple signatures in an input stream of packets at network line speeds. As network line speeds increase, conventional search engines employing the AC DFA technique are becoming increasingly insufficient to perform searches at line speeds because a memory lookup operation is typically performed for each character of the input string.
In an article entitled “Multi-Byte Regular Expression Matching with Speculation” authored by Daniel Luchaup et al, the authors propose searching an input string for regular expressions using multiple DFA engines in parallel with speculation. For example, in a search circuit having two DFA engines, the input string is divided into two non-overlapping portions or “chunks,” and the first and second chunks are processed in parallel by the first and second DFA engines, respectively. While the start state of the first DFA engine is known, the start state of the second DFA engine (which should be the final state of the first DFA engine) is not initially known, and is speculatively set to the initial DFA state. More specifically, the first and second DFA engines process their respective input chunks in parallel during a parallel processing stage until the first DFA engine completes processing the first chunk. Then, during a validation stage, the first DFA engine starts processing the second chunk and, for each character of the second chunk, compares the active state of the first DFA engine with the state previously reached by the second DFA engine until a match is found, at which point the first and second DFA engines are said to be in agreement and the states speculated by the second DFA engine are validated.
Although effective in improving the average processing speed of AC DFA search systems, the system disclosed by Luchaup et al cannot guarantee how many characters of the second input chunk need to be processed by the first DFA engine to achieve validation, and thus undesirably lacks predictability. In addition, the inability to predict with certainty how many iterations the validation stage requires to resolve speculation means that although the average processing speed is improved, the worst-case processing speed is not improved over search systems having a single DFA engine.
Thus, there is a need for an improved DFA speculative search system that can guarantee improvement in the worst-case processing speed and provide more behavior predictability.
Like reference numerals refer to corresponding parts throughout the drawing figures.