1. Technical Field
The present invention relates to the processing of regular expressions, and, in particular, to improved state-diagram formations and state-transition processing.
2. Description of Related Art
Packet-data communication, such as communication over the Internet, is extremely popular, and is becoming more so every day. People and companies routinely use Internet-connected computers and networks to conduct their affairs. Myriad types of data are transmitted over the Internet, such as personal correspondence, medical information, financial information, business plans, etc. Unfortunately, not all uses of the Internet are benign; on the contrary, a significant percentage of the data that is transmitted over the Internet every day is malicious. Examples of this type of data are viruses, spyware, malware, worms, etc.
Not unexpectedly, an entire industry has developed to combat these vicious attempts to disrupt and harm Internet-based communications, along with the networks and computers used by those who engage in Internet-based communications. This industry, and the effort to fight these malicious threats generally, is often referred to as “intrusion prevention.” One important aspect of intrusion prevention involves identifying known threats (files that are or contain viruses, worms, spyware, malware, etc.) by particular data patterns contained therein. These data patterns are sometimes referred to as “signatures” of the security threats.
As such, data (e.g. IP) packets flowing through, towards, or from a particular router, switch, network, etc. are often screened—perhaps by an intermediate device, functional component, or other entity sometimes referred to as a “bump in the wire”—for the presence of these signature data patterns. When particular packets, or sequences of packets, are identified as containing at least one of these signatures, those packets (or, again, sequences of packets) may be “quarantined,” analogous to the way that people or animals having been identified as or suspected of carrying a particular disease would be, such that those packets cannot cause harm to any more networks and/or computers. These packets, removed from the normal flow of data traffic, can then be further analyzed without holding up that traffic generally.
It can thus be appreciated that it would be advantageous for a network device to be able to quickly and accurately identify these signature data patterns across one or more packets, and to do so in a way that uses relatively little in the way of computing resources such as processing time and memory.
These signature data patterns are often expressed using what is known as a “regular expression,” which is an instance of a system of notation that can, fairly elegantly, represent complicated data patterns that, by their presence, may indicate a potential security threat. As a small example, a regular expression such as “.+[a]{5}[b]” may be used to represent a data pattern that could be stated as “one or more of (+) any type of character (.), followed by five consecutive ‘a’ characters ([a]{5}), followed by a ‘b’ character ([b]).” Thus, the data being analyzed, often referred to as the “subject,” would have to contain that data pattern at least once to be considered to match that regular expression, which often is also referred to as a “regex.”
The screening for these particular data patterns is often implemented using a state machine, which is typically generated from a particular regex. Basically, the characters (e.g. “@”) and character classes (e.g. “[0-9],” i.e. “any digit”) in the regex define the transitions between states in the machine. A particular data subject, perhaps the payload of one or more packets, would then be evaluated using the state machine; this evaluation essentially involves starting in an initial state, and using the characters in the subject to try to transition through the state machine. If the right sequence of characters is present in the subject to cause the processing of the state machine to arrive at what is known as a “match state,” then the subject is considered to match the regex that was used to generate the state machine, and the packet or packets that contained that subject may be quarantined for further analysis.
State machines are often also referred to as “automata,” and are of one of two types: deterministic finite automata (DFA) (a deterministic state machine, a.k.a. a DFA state machine) or nondeterministic finite automata (NFA) (a nondeterministic state machine, a.k.a. an NFA state machine). In general, a DFA will have no ambiguity; that is, from a given state, a given character in the subject will result in either zero or one valid transitions to a next state. In contrast, an NFA will have ambiguity in that, from a given state, a given character can, and often does, match more than one transition to a next state. Thus, in an NFA, multiple paths through the state machine may be valid for the same subject.
In general, then, DFAs are typically faster and more straightforward to execute, but involve a higher number of states, while NFAs typically can be implemented with many fewer states, which uses far less memory, but are more complex to execute, since multiple valid paths through the state machine must be assessed. And DFAs and NFAs have other advantages and disadvantages in comparison with the other, in addition to those mentioned here.
Further with respect to regex processing, the popularity of the PERL scripting language has had a significant effect on the art of processing regular expressions. The PERL syntax incorporates regex processing with a powerful and popular regex feature set. The PERL regex syntax is the most widely used ‘flavor’ today, and regex engines are often referred to as being ‘PERL compatible.’ There is a FOSS (Free Open Source Software) project called PCRE (Perl Compatible Regular Expressions), which is a library in the C programming language for compiling and processing regular expressions using the PERL syntax and feature set. This library is used in existing Intrusion Prevention System (IPS) products.
PCRE works at least in part by converting a regex to an NFA state machine. Again, an NFA state machine is referred to as being non-deterministic because, in some states, there may be more than one valid transition out to respective next states. Using an NFA gives a lot of flexibility in terms of language syntax, but processing an NFA can require a lot of attempts to match on failed branches of the state machine. The process of backing up in to a prior state after a failed match attempt is called ‘backtracking.’ The conventions used in PCRE call for attempting to match as much text as possible, and backtracking if the match fails. For this reason, backtracking is common and performance suffers. Note that the type of search done using the PCRE engine is called a ‘depth-first’ search, because it tries the deepest paths through the NFA state machine first, and then backtracks to try other branches.
As also noted above, there is a different type of state machine, called a DFA state machine (or just DFA), which does not require backtracking. It is possible to convert an NFA to a DFA with some restrictions on syntax features (e.g., no backreferences). A DFA is more like a traditional state machine that is driven from state to state based on events (i.e. characters in the subject being searched). As noted above, DFAs are generally faster than NFAs, but usually have more states, and thus require more memory.
Recent versions of PCRE include an Application Program Interface (API) purporting to implement a DFA search, though PCRE does not generate a DFA in this case. Instead, these versions walk an NFA in a breadth-first manner. This process is, in essence, like generating a DFA on the fly, every time the search is performed. The results are the same, but the performance is very slow.