§1. Field of the Invention
The present invention concerns pattern matching using regular expression matching. More specifically, the present invention concerns generating and using a finite automaton for regular expression matching.
§1.2. Background Information
Embodiments consistent with the present invention may be used in various applications which require regular expression matching. Such applications may include, for example, file search by an operating system or software application, syntax checking by compilers, and network security. The network security applications are instructed in detail below.
§1.2.1 Deep Packet Inspection for Network Security
Deep Packet Inspection (“DPI”) is a crucial technique used in today's Network Intrusion Detection System (“NIDS”). DPI is used to compare incoming packets, byte-by-byte, against patterns stored in a database to identify specific viruses, attacks, and/or protocols. Early DPI methods relied on exact string matching for attack detection. (See, e.g., the references: S. Wu and U. Manber, “A Fast Algorithm for Multi-Pattern Searching,” Dept. of Computer Science, University of Arizona, Tech. Rep. (1994) (incorporated herein by reference); A. V. Aho and M. J. Corasick, “Efficient String Matching: An Aid to Bibliographic Search,” Commun. of the ACM, Vol. 18, No. 6, pp. 333-340 (1975) (incorporated herein by reference); S. Dharmapurikar and J. W. Lockwood, “Fast and Scalable Pattern Matching for Network Intrusion Detection Systems,” IEEE J SEL AREA COMM, Vol. 24, No. 10, pp. 1781-1792 (2006) (incorporated herein by reference); and N. Tuck, T. Sherwood, B. Calder, and G. Varghese, “Deterministic Memory-Efficient String Matching Algorithms for Intrusion Detection,” Proc. of IEEE INFOCOM (2004) (incorporated herein by reference).) On the other hand, recent DPI methods use regular expression matching (See, e.g., the references: F. Yu, Z. Chen, Y. Diao, T. V. Lakshman, and R. H. Katz, “Fast and Memory-Efficient Regular Expression Matching for Deep Packet Inspection,” Proc. of ACM/IEEE ANCS (2006) (incorporated herein by reference); S. Kumar, S. Dharmapurikar, F. Yu, P. Crowley, and J. Turner, “Algorithms to Accelerate Multiple Regular Expressions Matching for Deep Packet Inspection,” Proc. of ACM SIGCOMM (2007) (incorporated herein by reference); R. Smith, C. Estan, and S. Jha, “XFA: Faster Signature Matching with Extended Automata,” IEEE Symposium on Security and Privacy (2008) (incorporated herein by reference); and M. Becchi and P. Crowley, “A Hybrid Finite Automaton for Practical Deep Packet Inspection,” Proc. of ACM CoNEXT (2007) (incorporated herein by reference)) because it provides better flexibility in the representation of ever evolving attacks. (See, e.g., the reference, R. Sommer and V. Paxson, “Enhancing Byte-Level Network Intrusion Detection Signatures with Context,” Proc. of the ACM Conference on Computer and Communications Security (CCS) (2003) (incorporated herein by reference).) Indeed, regular expression matching has been widely used in many NIDSes such as Snort (See, e.g., “A Free Lightweight Network Intrusion Detection System for UNIX and Windows,” available online at http://www.snort.org (incorporated herein by reference)), Bro (See, e.g., Bro Intrusion Detection System, available online at http://www.broids.org) (incorporated herein by reference)); and several network security appliances from Cisco systems (See, e.g., “Cisco IPS Deployment Guide,” available online at http://www.cisco.com (incorporated herein by reference)). It has become the de facto standard for content inspection.
§1.2.2 Using Deterministic Finite Automatons (“DFAS”) and Nondeterministic Finite Automatons (“NFAS”) to Represent Regular Expressions
Despite its ability to represent attacks with flexibility, regular expression matching introduces significant computational and storage challenges. Deterministic Finite Automatons (“DFAs”) and Nondeterministic Finite Automatons (“NFAs”) are two typical representations of regular expressions. Given a set of regular expressions, one can easily construct the corresponding NFA. The DFA can be further constructed from the NFA using a subset construction scheme. (See, e.g., the reference, J. E. Hoperoft, R. Motwani, and J. D. Ullman, Introduction to Automata Theory, Languages, and Computation—International Edition, (2nd Ed) (Addison-Wesley, 2003) (incorporated herein by reference).)
DFAs and NFAs have quite different performance and memory usage characteristics. A DFA has at most one active state during the entire matching process. Therefore, a DFA requires only one state traversal for each character processing. This results in a deterministic memory bandwidth requirement. The main problem of using a DFA to represent regular expressions is the DFA's severe state explosion problem (See, e.g., F. Yu, Z. Chen, Y. Diao, T. V. Lakshman, and R. H. Katz, “Fast and Memory-Efficient Regular Expression Matching for Deep Packet Inspection,” Proc. of ACM/IEEE ANCS (2006) (incorporated herein by reference)), which often leads to a prohibitively large memory requirement. In contrast, an NFA represents regular expressions with much less memory storage. However, this memory reduction comes with the tradeoff of a high and unpredictable memory bandwidth requirement (because the number of concurrent active states in an NFA is unpredictable during the matching). Processing a single character in a packet with an NFA may induce a large number of state traversals. This causes a large number of memory accesses, which limits matching speed.
Recently, research proposed in literature pursues a tradeoff between the computational complexity and storage complexity for the regular expression matching (See, e.g., the references: F. Yu, Z. Chen, Y. Diao, T. V. Lakshman, and R. H. Katz, “Fast and Memory-Efficient Regular Expression Matching for Deep Packet Inspection,” Proc. of ACM/IEEE ANCS (2006) (incorporated herein by reference); S. Kumar, S. Dharmapurikar, F. Yu, P. Crowley, and J. Turner, “Algorithms to Accelerate Multiple Regular Expressions Matching for Deep Packet Inspection,” Proc. of ACM SIGCOMM (2007) (incorporated herein by reference); R. Smith, C. Estan, and S. Jha, “XFA: Faster Signature Matching with Extended Automata,” IEEE Symposium on Security and Privacy (2008) (incorporated herein by reference); M. Becchi and P. Crowley, “A Hybrid Finite Automaton for Practical Deep Packet Inspection,” Proc. of ACM CoNEXT (2007) (incorporated herein by reference); R. Sommer and V. Paxson, “Enhancing Byte-Level Network Intrusion Detection Signatures with Context,” Proc. of the ACM Conference on Computer and Communications Security (CCS) (2003) (incorporated herein by reference); and S. Kumar, J. Turner, and J. Williams, “Advanced Algorithms for Fast and Scalable Deep Packet Inspection,” Proc. of ACM/IEEE ANCS (2006) (incorporated herein by reference). Among these proposed solutions, some (See, e.g., the M. Becchi and P. Crowley reference and the R. Sommer and V. Paxson reference), like the present invention, seek to design a hybrid finite automaton fitting between DFAs and NFAs. Unlike the present invention, however, these proposed automatons, though compact and fast when processing common traffic, suffer from poor performance in the worst cases. This is because none of them can guarantee an upper bound on the number of active states during the matching processing. This weakness can potentially be exploited by attackers to construct a worst-case traffic that can slow down the NIDS and cause malicious traffic to escape inspection. In fact, the design of a finite automaton with a small (larger than one) but bounded number of active states remains an open and challenging problem.
§1.2.3 Related Work in Regular Expression Matching
Most of the current research in regular expression matching focuses on reducing the memory usage of DFAs and can be classified into (1) transition reduction, (2) state reduction, or (3) hybrid finite automaton. Each of these memory usage reduction techniques is described below.
“Transition reduction” schemes reduce the memory usage of a DFA by eliminating redundant transitions. The D2FA (See, e.g., S. Kumar, S. Dharmapurikar, F. Yu, P. Crowley, and J. Turner, “Algorithms to Accelerate Multiple Regular Expressions Matching for Deep Packet Inspection,” Proc. of ACM SIGCOMM (2007) (incorporated herein by reference)), proposed by Kumar et al. is a representative method in this category. It eliminates redundant transitions in a DFA by introducing default transitions, and saves memory usage. However, the memory access times for each input character increases. After the D2FA, many other schemes, such as the CD2FA (See, e.g., the references: S. Kumar, J. Turner, and J. Williams, “Advanced Algorithms for Fast and Scalable Deep Packet Inspection,” Proc. of ACM/IEEE ANCS (2006) (incorporated herein by reference); and M. Becchi and P. Crowley, “An Improved Algorithm to Accelerate Regular Expression Evaluation,” Proc. of ACM/IEEE ANCS (2007) (incorporated herein by reference)) were proposed to improve the D2FA's worst-case run-time performance and construction complexity.
“State reduction” schemes reduce the memory usage of a DFA by alleviating its state explosion. Since many regular expressions interact with others, the composite DFA for multiple regular expressions could possibly be extremely large. This is referred to as “state explosion”. Yu et al. (See, e.g., F. Yu, Z. Chen, Y. Diao, T. V. Lakshman, and R. H. Katz, “Fast and Memory-Efficient Regular Expression Matching for Deep Packet Inspection,” Proc. of ACM/IEEE ANCS (2006) (incorporated herein by reference)) and Jiang et al. (See, e.g., J. Jiang, Y. Xu, T. Pan, Y. Tang, and B. Liu, “Pattern-Based DFA for Memory-Efficient and Scalable Multiple Regular Expression Matching,” Proc. of IEEE ICC, pp. 1-5 (May 2010) (incorporated herein by reference)) propose to combine regular expressions into multiple DFAs instead of one to eliminate the state explosion. Although state reduction schemes reduce memory usage, they usually require much more DFAs. This, in turn, increases the memory bandwidth demand linearly with the number of DFAs used. The XFA uses auxiliary memory to significantly reduce memory. (See, e.g., the references: R. Smith, C. Estan, and S. Jha, “XFA: Faster Signature Matching with Extended Automata,” IEEE Symposium on Security and Privacy (2008) (incorporated herein by reference); and R. Smith, C. Estan, S. Jha, and S. Kong, “Deflating the Big Bang Fast and Scalable Deep Packet Inspection with Extended Finite Automata,” Proc. of ACM SIGCOMM (2008) (incorporated herein by reference).) Unfortunately, however, the creation of XFA requires a lot of manual work, which is error-prone and inefficient. Further, its performance is non-deterministic. The reference, M. Becchi and S. Cadambi, “Memory-Efficient Regular Expression Search Using State Merging,” Proc. of IEEE INFOCOM, pp. 1064-1072 (May 2007) (incorporated herein by reference) proposed an algorithm to merge DFA states by introducing labels on their input and output transitions. The reference, S. Kumar, B. Chandrasekaran, J. Turner, and G. Varghese, “Curing Regular Expressions Matching Algorithms from Insomnia, Amnesia, and Acalculia,” Proc. of ACM/IEEE ANCS (2007) (incorporated herein by reference) proposed history-based finite automatons to record history information in matching which capture one of the major reasons for DFA state explosion and reduce the memory cost. However, recording history will increase the worst case complexity and thus compromise scalability.
“Hybrid Finite Automaton” schemes in this category aim at designing automatons fitted into the middle ground between NFAs and DFAs so that the strengths of both NFAs and DFAs can be obtained. Becchi et al. proposed a hybrid finite automaton called Hybrid-FA which consists of a head DFA and multiple tail-NFAs/tail-DFAs. (See, e.g., M. Becchi and P. Crowley, “A Hybrid Finite Automaton for Practical Deep Packet Inspection,” Proc. of ACM CoNEXT (2007) (incorporated herein by reference).) Although a Hybrid-FA can achieve an average case memory bandwidth requirement similar to that of a single DFA with significantly reduced memory usage, its worst case memory bandwidth requirement is unpredictable and varies when the regular expression rule set is updated. Lazy DFA (See, e.g., R. Sommer and V. Paxson, “Enhancing Byte-Level Network Intrusion Detection Signatures with Context,” Proc. of the ACM Conference on Computer and Communications Security (CCS) (2003) (incorporated herein by reference)) is another automaton used to leverage the advantages of both NFAs and DFAs. Its main function is to store only frequently used DFA states in memory, while leaving others in NFA representation. In case an uncommon DFA state is required, lazy DFA has to be extended at run-time from the NFA. Consequently, although the Lazy DFA automaton is fast and memory-efficient in common cases, in the worst case the whole DFA needs to be expanded, making it vulnerable to malicious traffic.
Thus, there is a need for improved techniques and apparatus for regular expression matching.