This disclosure relates generally to the field of finite state automatons (FSAs), and more particularly to identifying and handling subexpression overlaps in FSA transformations that are associated with regular expression decompositions.
Packet content scanning is an essential part of network security and monitoring applications. Intrusion detection systems such as Snort (http://www.snort.org) rely heavily on regular expressions to express increasingly complex attack patterns. A typical way of matching regular expressions in a stream of input characters is by simulating the input on a Finite State Automaton (FSA), which may be a nondeterministic FSA (NFA) or a deterministic FSA (DFA), compiled from the regular expression. For example, FIG. 1 shows an example of a FSA 100 comprising a DFA that detects the regular expression “abc.*def*ghi” in an input data stream. The regular expression “abc.*def*ghi” is in perl compatible regular expression (PCRE) format. The FSA 100 is modeled as a directed graph. The FSA states are shown in circles, the state transitions are shown using directed edges, and the set of input characters resulting in the transitions (i.e., the transition rules) are given in the rectangular boxes. The initial state of the FSA is labeled as state 0, with intermediate states numbered 1 to 8, leading up to a match of the regular expression at state number 9. The plurality of transition rules governs transitions between the states. Note that if the regular expression is non-anchored, additional transitions that point to state 0 and state 1 would be needed in FIG. 1. Similarly, if the regular expression is anchored, there has to be an explicit invalid state in the FSA and additional transitions pointing to the invalid state for state/input combinations without a valid next state.
FIG. 2 shows an example of a FSA 200 comprising a DFA that is a transformation of the FSA 100 that was shown in FIG. 1. FSA 200 also detects the regular expression “abc.*def*ghi”. If the original regular expression is non-anchored, it may be decomposed, or split, into independent subexpressions “abc”, “def”, and “ghi”, allowing the transformed DFA 200 to match each of the subexpressions independently. Starting at state 0 (zero), the leftmost column of states and transition rules detects the presence of “abc” in the input stream by proceeding through states 1 and 4 to state 7. At state 7, a first register is set indicating that “abc” was matched. Then, proceeding through states 2 and 5 to state 8 detects a match of “def”. In state 8, the first register is tested, and if the first register is set, then a second register is set indicating the presence of “abc.*def”. Lastly, proceeding through states 3 and 6 to state 9 detects a match of “ghi”, and, in state 9, the second register is tested. If the second register is set, then a match of the whole regular expression “abc.*def.*ghi” is indicated. The transformed FSA 200 may be implemented in three parallel DFAs with a post-processor. The first and second registers may be 1-bit registers located in the post-processor. The transformed FSA 200 of FIG. 2 includes a significantly smaller number of state transitions as compared to the initial FSA 100 of FIG. 1 while implementing the same functionality, reducing the amount of memory needed to store the FSA 200.