In the field of network intrusion detection, a regular expression is frequently used to detect whether network data includes malicious data in a specific format so as to judge whether a network intrusion occurs. Since the regular expression has features of flexibility and good expression ability, it is widely used in the field of network intrusion detection.
In order to use the regular expression to match data, a regex engine generally needs to be constructed based on the regular expression. There are currently two types of regex engines, Non-deterministic Finite Automaton (NFA) regex engines and Deterministic Finite Automaton (DFA) regex engines. However, because the backtracking feature of NFA can not be changed, the matching speed of NFA can not be significantly improved. Therefore, the DFA regex engine is widely used currently. However, DFA itself has a problem of state expansion of a state machine.
Basic working principles of a DFA regex engine are as follows: firstly, pre-compile a regular expression (or all regular expressions in a regular expression set), according to specific rules, into a deterministic finite state machine; use a character string to be checked as an input of the finite state machine to induce state transitions of the finite state machine; and check whether the character string has been matched with a specific regular expression during a state-transition process of the state machine. Each state in the finite state machine includes two basic elements: (1) a match list and (2) a state transition array. The match list includes a serial number of the regular expression, if the match list is not null, it indicates that the input data stream has matched with the regular expression corresponding to the serial number when the state machine runs to this state; otherwise, no matching occurs. As to the state transition array, the current state is required to be able to decide which state to jump to according to an input character; the length of the state transition array is just the number of all possible input characters; the indices of the state transition array are just all the possible input characters; and a value of the state transition array is just the serial number of a state to jump to when a character corresponding to the index is input under the current state.
The state transition array is the cause of state expansion of a DFA engine. Supposing that an input character set is an American standard code for information interchange (ASCII) table, the state transition array of each state is just an int type or short type array with a length of 256, and occupies a memory of 1K or 512 B. Due to complexity of the network intrusion, in practical network intrusion detection, a plurality of complex regular expressions may be applied to the same segment of network data for matching, and thereby the state number of the DFA state machine obtained by compiling may reach an order of magnitude of 104 to 105, which will lead to memory exhaustion of a system.
Three solutions are available to the state expansion problem of the DFA regex engine currently:
1. A method for reducing the number of state transitions of a DFA state machine is disclosed in the article “Algorithms to accelerate multiple regular expressions matching for deep packet inspection[c]” Proceedings of the 2006 Conference on Applications, Technologies, Architectures and Protocols for Computer Communications. New York: ACM press, 2006: 229-350 by KUMAR S, DHARMAPURIKAR S, YU F, et al in 2006, the article “Advanced algorithms for fast and scalable deep packet inspection[c]” Proceedings of the 2006 ACM/IEEE Symposium on Architecture for Networking and Communication Systems. New York: ACM Press, 2006: 81-92 by KUMAR S, YURENER, WILLIAMS J., et al in 2006, and the article “An improved DFA for fast regular expression matching [J]” ACM SIGCOMM Computer Communication Review, 2008, 38(5):29-40 by FACARA D, ESTAN C, J H A S., PROCISSI G, et al in 2008 (the contents of the above articles are incorporated herein by reference in their entirety). The method for reducing the number of state transitions of the DFA state machine saves memory by adding “edge”; however, the method has following problems: A transition path of the state machine is longer due to the introduced “edge”, so that matching efficiency of the regex engine is lowered. In addition, the method is lack of universality, and “edge” may not be added for all states. The memory-saving effect of the method depends on a ratio of states capable of adding “edge” to all the states, and thereby ultimately depends on the regular expression. That is, for some regular expressions, the method can better save the memory; whereas, for some other regular expressions, the memory saving effect will be poor. This method is not suitable for a network intrusion detection system that needs to use a large number of regular expressions.
2. A method for reducing the number of states of a DFA state machine is disclosed in the article “Xfa: Faster signature matching with extended automata[c]” Proceedings of the 2008 IEEE Symposium on Security and Privacy. Washington, D.C.: IEEE, 2008:187-201 by SMITH R, ESTAN C, J H A S, et al in 2008, and the article “Memory-efficient regular expression search using state merging[c]” INFOCOM 2007: 26th IEEE International Conference on Computer Communications. Washington D.C.: IEEE, 2007:1064-1072 by BECCHI M and CADAMBI S in 2007 (the contents of the above articles are incorporated herein by reference in their entirety). The method for reducing the number of states of the DFA state machine compresses the states in the DFA state machine by introducing a state bit so as to reduce the number of states. However, the method for reducing the number of states of the DFA state machine also has the following defects: firstly, a memory occupied by additional information (state bits) introduced by the method can not be ignored when the DFA state machine is comparatively complex; secondly, the method is also lack of universality, for example, the method may better solve a problem of state expansion of a regular expression .*ab.*cd|.*ef.*g, but can not solve a state expansion caused by a regular expression .*ab[^\n]*cd|.*ef[^\n]*gh.
3. A method for alphabet compression is disclosed in the article “An improved algorithm to accelerate regular expression evaluation[c]” Proceedings of the 3rd ACM/IEEE Symposium on Architecture for Networking and Communications Systems. New York: ACM Press, 2007: 145-154 by BECCHI M and CROWLEY P in 2007 (the contents of the above article are incorporated herein by reference in their entirety). The method for alphabet compression shortens the length of state transition array of each state in the state machine by compressing the alphabet, so as to reduce memory consumption of a DFA state machine. However, the method has following defects: the length of the state transition array is shortened by adopting the method for alphabet compression, but mapping between a compressed alphabet and a complete alphabet needs to be performed during an actual state transition process, which will reduce matching efficiency of the DFA state machine; moreover, the method is only suitable for a situation that the same target state can be reached when characters are received in different source states, and an application scope of the method is very limited.
It can be seen from the above that, although various solutions are proposed to solve the state expansion problem of the DFA regex engine, these solutions still have obvious defects. A novel technical solution capable of better solving the state expansion problem of the DFA is needed.