The present invention relates generally to processing regular expressions, and more specifically, to distributing large sets of regular expressions to a fixed number of state machine engines, including extended deterministic finite automata engines, in an efficient manner.
A growing number of applications, including virus scanners, intrusion detection, and text analytics, use regular expressions to define patterns that the system must find in input character streams. The regular expressions generally are modeled as one or more finite state machines, including nondeterministic finite automata (NFAs), which can be relatively easy to construct and often provide relative flexibility, and deterministic finite automata (DFAs), which provide predictability and can support high performance processing.
Unlike DFAs, in NFAs an input to a state may result in a transition to more than one state. In general, NFAs may be converted into DFAs using a powerset construction, or subset construction, algorithm. DFAs may be more efficiently executed than equivalent NFAs, since each state transition in a DFA requires an input and results in a single state. However, DFAs also potentially may have an exponentially higher number of states, as compared to equivalent NFAs.
As a result, systems that deal with large sets of regular expressions typically may face state explosion, a scenario in which translation of a regular expression results in a DFA with an exponentially larger number of states than the corresponding NFA. State growth resulting from one or a few regular expressions often may be manageable, but as the size of a regular expression increases, and as more regular expressions are combined, the state explosion problem becomes increasingly relevant and can become unmanageable for most systems. This problem is acute for large sets of regular expressions, for example, numbering in the thousands. Thus, execution of the DFAs often presents a performance bottleneck.
Typically, at runtime each DFA may be executed by one of a relatively small number of execution engines. Each of the execution engines in parallel takes a character from and input stream and performs a state transition depending on the current state and the input character. Extended DFA engines often offer limited local memory for variables and arithmetic operations to augment the pure state-transition semantics. For example, variables may be used to implement a more efficient counter than a pure state machine is able to implement. Such extensions may be helpful, for example, in implementing regular expressions that include bounded repetitions.
Hardware-based, relatively high-performance systems are available to compile regular expressions into equivalent DFAs. Solutions for distributing regular expressions among DFA engines are available in which the fundamental problem of state explosion is simply ignored, for applications in which the set of regular expressions is sufficiently small. Solution are available that implement NFAs instead of DFAs, or implement hybrid systems that combine the use of NFAs and DFAs. However, the use of NFAs cannot always provide the required level of deterministic high-speed performance, making the scalability and performance of such systems uncertain.
Other solutions are available that implement only a single DFA engine, but in this case all of the regular expressions must be processed by the same engine. Heuristics-based approaches are available to determine utilization extended resources, but the heuristics do not typically take into account multiple recognition engines. In addition, “brute-force,” or extensive simulation, approaches are available, but these generally do not scale to large numbers of regular expressions.