Regular expression search operations are employed in various applications including, for example, intrusion detection systems (IDS), virus protections, policy-based routing functions, internet and text search operations, document comparisons, and so on. A regular expression can simply be a word, a phrase or a string of characters. For example, a regular expression including the string “gauss” would match data containing gauss, gaussian, degauss, etc. More complex regular expressions include metacharacters that provide certain rules for performing the match. Some common metacharacters are the wildcard “.”, the alternation symbol “I’, and the character class symbol “[ ].” Regular expressions can also include quantifiers such as “*” to match 0 or more times, “+” to match 1 or more times, “?” to match 0 or 1 times, {n} to match exactly n times, {n,} to match at least n times, and {n,m} to match at least n times but no more than m times. For example, the regular expression “a.{2}b” will match any input string that includes the character “a” followed exactly 2 instances of any character followed by the character “b” including, for example, the input strings “abbb,” adgb,” “a7yb,” “aaab,” and so on.
Traditionally, regular expression searches have been performed using software programs executed by one or more processors, for example, associated with a network search engine. For example, one conventional search technique that can be used to search an input string of characters for multiple patterns is the Aho-Corasick (AC) algorithm. The AC algorithm locates all occurrences of a number of patterns in the input string by constructing a finite state machine that embodies the patterns. More specifically, the AC algorithm constructs the finite state machine in three pre-processing stages commonly referred to as the goto stage, the failure stage, and the next stage. In the goto stage, a deterministic finite state automaton (DFA) or search trie is constructed for a given set of patterns. The DFA constructed in the goto stage includes various states for an input string, and transitions between the states based on characters of the input string. Each transition between states in the DFA is based on a single character of the input string. The failure and next stages add additional transitions between the states of the DFA to ensure that a string of length n can be searched in exactly n cycles. More specifically, the failure and next transitions allow the state machine to transition from one branch of the tree to another branch that is the next best (i.e., the longest prefix) match in the DFA. Once the pre-processing stages have been performed, the DFA can then be used to search any target for all of the patterns in the pattern set.
One problem with prior string search engines utilizing the AC algorithm is that that they are not well suited for performing wildcard or inexact pattern matching. As a result, some search engines complement the AC search technique with a non-deterministic finite automaton (NFA) engine that is better suited to search input strings for inexact patterns, particularly those that include quantifiers such as “*” to match 0 or more times, “+” to match 1 or more times, “?” to match 0 or 1 times, {n} to match exactly n times, {n,} to match at least n times, and {n,m} to match at least n times but no more than m times.
For example, commonly-owned U.S. Pat. No. 7,539,032 discloses a content search system that implements search operations for regular expressions that specify one or more exact patterns and one or more inexact patterns by delegating exact pattern search operations to a DFA engine that is dedicated to perform exact pattern search operations and by delegating inexact pattern search operations to an NFA engine that is dedicated to perform inexact pattern search operations, where the match results of the exact pattern search operations and the match results of the inexact pattern search operations are combined to generate a result code that indicates whether an input string matches one or more regular expressions specifying the exact and inexact patterns.
More specifically, FIG. 1 shows a content search system 100 of the type disclosed in commonly-owned U.S. Pat. No. 7,539,032. Search system 100 includes a system interface 110, a DFA engine 120, a data management unit 130, an NFA engine 140, and a result memory 150. The system interface 110 facilitates communication between search system 100 and an external network (e.g., the Internet), and is coupled to data management unit 130. More specifically, data management unit 130 receives input strings from the network via the system interface 110, selectively forwards portions of input strings to DFA engine 120 and/or NFA engine 140 for search operations, and coordinates communication between DFA engine 120, NFA engine 140, and result memory 150.
As disclosed in U.S. Pat. No. 7,539,032, the DFA engine 120 is configured to perform exact string match search operations to determine whether an input string contains exact patterns specified by one or more regular expressions, and the NFA engine 140 is configured to perform an inexact string match search operation to determine whether the input string contains one or more inexact patterns specified by one or more regular expressions. The DFA engine 120 is implemented according to the AC algorithm, and the NFA engine 140 is implemented using various circuits (e.g., microprocessors, microcontrollers, programmable logic such as FPGAs and PLDs) that can execute microprograms that embody the inexact patterns to be searched for. The result memory 150 includes a plurality of storage locations each for storing a result code that contains one or more match ID (MID) values, one or more trigger bits, and one or more microprogram indices. Each MID value identifies a corresponding exact pattern stored in the DFA engine that is matched by the input string, each trigger bit indicates whether the exact pattern identified by a corresponding MID value is part of a regular expression that requires inexact pattern search operations to be performed by the NFA engine, and each microprogram index can be used by the NFA engine to retrieve a microprogram that contains commands for implementing the inexact pattern search operation.
As mentioned above, content search system 100 implements regular expression search operations by delegating exact pattern matching functions to DFA engine 120 and delegating inexact pattern matching functions to NFA engine 140. For example, to determine whether an input string matches the regular expression R10=“acid[a-n]{10,20}rain” using the search system 100 shown in FIG. 1, the exact patterns “acid” and “rain” are loaded into DFA engine 120, and a microprogram embodying the inexact pattern “[a-n]{10,20}” is loaded into the NFA engine 140. Further, the result memory is loaded with MID values for “acid” and “rain” and with a microprogram index identifying the microprogram that embodies the inexact pattern “[a-n]{10,20}.” Then, during search operations, data management unit 130 forwards the input string to DFA engine 120, which in turn compares the input string to the prefix and suffix patterns “acid” and “rain.” If a portion of the input string matches the prefix pattern “acid,” the DFA engine 120 generates a first match index corresponding to the prefix pattern “acid,” and in response thereto, the result memory 150 generates a result code that activates the NFA engine 140 and tells the NFA engine 140 the location of the microprogram embodying the inexact pattern “[a-n]{10,20}.” Once activated, the NFA engine 140 begins searching the input string for the inexact pattern “[a-n]{10,20,}” and also combines the exact match results from the DFA engine 120 for the suffix pattern “rain” to determine whether the input string contains the regular expression R10=“acid[a-n]{10,20}rain.”
Although the delegation of different portions (e.g., sub-expressions) of a regular expression to DFA and NFA search engines improves performance over conventional single-engine approaches, the limited resources of the NFA engine 140 can be quickly exhausted when searching for some types of regular expressions. For example, to implement search operations for the regular expression R11=“ab.*cd” in search system 100, the prefix string “ab” is delegated to DFA engine 120 and the inexact pattern “.*cd” (which includes the unbounded sub-expression “.*”) is delegated to the NFA engine 140. If the DFA engine detects a match with “ab”, the NFA engine 140 is triggered, and it activates an NFA state that starts searching the input string for the suffix pattern “cd”. However, because the “cd” can appear after zero or more instances of any character (thus making the “.*” an unbounded sub-expression), the corresponding NFA state remains active indefinitely (even after the suffix “cd” is found), which in turn indefinitely consumes a corresponding portion of the NFA engine's limited resources. When configured as such, for each rule or regular expression that includes the unbounded sub-expression “.*” a corresponding portion of the NFA engine's resources are indefinitely consumed, thereby not only undesirably degrading performance but also limiting the number of regular expressions or rules that the NFA engine can concurrently search for.
In addition, using the result memory to trigger the NFA engine and to provide a microprogram index in response to a match in the DFA engine is inefficient. Further, operating DFA and NFA engines of search system 100 in a parallel manner can allow partial match results from the engines to be generated out-of-order, in which case additional processing is performed (e.g., by the NFA engine) to re-order the partial match results so that match results between the input string and the regular expressions can be generated.
Thus, there is a need for a search system that can capitalize on the advantages of DFA and NFA search techniques with greater efficiency and without unnecessarily consuming the limited resources of either the DFA or NFA engine.
Like reference numerals refer to corresponding parts throughout the drawing figures.