§1.1 Field of the Invention
The present invention concerns matching an arbitrary length string with one of a number of known regular expressions, and/or configuring state machines for this purpose. Such matching may be used for network intrusion detection and prevention.
§1.2 Background Information
Although Regular Expressions (referred to as “RegExes”) have been widely used in network security applications, their inherent complexity often limits the total number of RegExes that can be detected using a single chip for a reasonable throughput. This limit on the number of RegExes impairs the scalability of today's RegEx detection systems. The scalability of existing schemes is generally limited by traditional per character state processing and state transition detection paradigm.
RegExes are used to flexibly represent complex string patterns in many applications ranging from Network Intrusion Detection and Prevention Systems (“NIDPS”) (See, e.g., “Bro Intrusion Detection System,” http://www.bro-ids.org, and “Snort Network Intrusion Detection System,” http://www.snort.org, both incorporated herein by reference.) to compilers (See, e.g., A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman, Compilers. Principles, Techniques, and Tools (2nd Edition) (Addison Wesley, 2006), incorporated herein by reference.) and DNA multiple sequence alignment (See, e.g., the articles, A. N. Arslan, “Multiple Sequence Alignment Containing a Sequence of Regular Expressions,” CIBCB, pp. 1-7 (2005), and Y. S. Chung, W. H. Lee, C. Y. Tang, and C. L. Lu, “RE-MuSiC: a Tool for Multiple Sequence Alignment with Regular Expression Constraints,” Nucleic Acids Res (Web Server issue), no. 35, pp. W639-644, 2007, both incorporated herein by reference.). For instance, RegExes are used for representing programming language tokens in compilers or constraint formulation for DNA multiple sequence alignment. In particular, NIDPSs Bro and Snort, and Linux Application Level Packet Classifier (L7 filter) (See, e.g., J. Levandoski, E. Sommer, and M. Strait, “Application Layer Packet Classifier for Linux,” http://17-filter.sourceforge.net, incorporated herein by reference.) use RegExes to represent attack signatures or packet classifiers.
A NIDPS RegEx detection system for high speed networks should satisfy the following two requirements—(1) scalability and (2) high throughput. A scalable detection system can accommodate more policies (rules) that increase the flexibility of the NIDPS. NIDPS should also support ever increasing traffic demand. The number of RegExes in the Snort signature set increased from 1131 RegExes in 2006, to 2290 RegExes in 2009, and the typical RegEx length keeps growing.
Internet carriers and vendors have successfully experimented with 100-Gbps equipment. In September 2008, Verizon successfully performed a 100-Gbps transmission for more than 646 miles. (See, e.g., “Verizon and Nokia Siemens Networks Set New Record for 100 Gbps Optical Transmission,” http://newscenter.verizon.com, incorporated herein by reference.) Cisco's global IP traffic forecast report states that global IP traffic will double nearly every two (2) years by the end of 2012 and, consequently, annual global IP traffic will exceed half a zetta (1021) bytes in 2013. (See, e.g., “Networking Solutions White Paper,” http://www.cisco.com, incorporated herein by reference.) In the art, there is no current state-of-the-art regular expression detection system capable of supporting such large RegEx sets (i.e., RegEx sets of 1000 RegExes or more, such as Snort which includes at least 2290 RegExes at this time), and even larger RegEx sets expected in the future, with high speed (e.g., 40 Gbps to 100 Gbps, or higher) demand. Thus, there is a need to support current and expected future RegEx sets with small memory requirements. Detection at line-speed is another desirable attribute of NIDPS. It would be desirable to be able to use very high speed on-chip memory, even for large RegEx sets, thereby supporting a higher throughput required for today's and tomorrow's line speeds.
Finite Automata (“FA”) are the de-facto tools to address the RegEx detection problem. For RegEx detection on an input, the FA starts at an initial state. Then, for each character in the input, the FA transitions to the new state determined by the previous state and the current input character. If the resulting state is unique, the FA is called a Deterministic Finite Automaton (“DFA”); otherwise, it is called a Non-Deterministic Finite Automaton (“NFA”). (See, e.g., N. Wirth, Compiler Construction (Addison Wesley, 1996), incorporated herein by reference.)
NFA and DFA represent two extreme cases of the tradeoff between performance and resource usage. DFA has constant time complexity since it guarantees only one state transition per character by definition. The constant time complexity per character allows DFA to achieve high-speed RegEx detection, which makes DFA the approach preferred currently. This is especially the case for software solutions, since DFA can be executed fast, serially on commodity CPUs. NFA, on the other hand, typically requires massive parallelism, making it harder to implement in software. NFA also allows multiple simultaneous state transitions, which leads to a higher time complexity. However, the price paid for the high speed of DFA is its prohibitively large memory requirement.
Recent applications, most notably deep packet inspection for NIDPS, require RegEx detection scalable to high quantities of complex RegExes. This need for scalability makes the DFA impractical for such applications, due to its large memory requirement. Since NFA is not fast, RegEx detection at high speeds, scalable to large sets of complex RegExes, is still an open issue. Thus, a better NIDPS is needed.
The present inventors note that three characteristics common to most of today's state-of-the-art RegEx detection systems limit the scalability of these systems. First, RegExes consist of a variety of different “components,” such as character classes or repetitions. (See, e.g., “Perl Compatible Regular Expressions,” http://www.pcre.org/, incorporated herein by reference.) This variety makes it hard to identify a method for concurrently detecting all of these different components of a RegEx efficiently. However, in most cases today, these heterogeneous components are detected by a state machine that consists of homogeneous states. This leads to inefficiencies that limit the scalability of such systems. Second, the order of components in a RegEx is preserved in the state machine detecting this RegEx. However, it may be beneficial to change the order of the detection of components in a RegEx. For instance, it is easy to detect exact strings (simple strings) as they are expected to appear less frequently, whereas others may appear more frequently (for instance, character classes). (The present inventors found that by reordering the detection of the components, the trade-off between detection complexity and appearance frequency can be exploited for better scalability.) Finally, most RegExes share similar components. In the traditional FA approaches, a state machine is used to detect a component in a RegEx. This state machine is duplicated since the similar component may appear multiple times in different RegExes. Furthermore, most of the time, these RegExes sharing this component cannot appear at the same time in the input. As a result, the duplication of the same state machine for different RegExes introduces redundancies, which limit the scalability of the RegEx detection system.
Based on the foregoing three observations of the inventors, a LaFA scheme is introduced—a novel and inventive detection method that resolves scalability issue of current RegEx detection paradigm. The scalable compact Look-ahead Finite Automata (“LaFA”) data structure requires small memory footprint and makes it feasible to implement large RegEx sets using only very fast, small on-chip memory.