Many network security applications in today's networks are based on deep packet inspection, checking not only the header portion but also the payload portion of a packet. For example, traffic monitoring, layer-7 filtering and network intrusion detection all require an accurate analysis of packet content in search of matching a predefined data set of patterns to identify specific classes of applications, viruses, attack signatures, etc. Traditionally, the data sets were constituted of a number of signatures to be searched with string matching algorithms, but nowadays regular expressions are used due to their increased expressiveness and ability to describe a wide variety of payload signatures. Multi-pattern regex matching in which packet payloads are matched against a large set of patterns is an important algorithm in network security applications.
As the network grows, new application and new protocols increase every day, the patterns in data set change very fast and need to update into network security applications frequently. That demands for multi-pattern regex matching algorithms that can be compiled in very short time. On the other hand, as network security applications need to process packets online in real time and high-speed, the multi-pattern regex matching engine will impact the throughput and latency. This gives a big challenge to the performance and memory footprint to the multi-pattern regex matching engine.
Nowadays, most processor vendors are increasing the number of cores in a single chip. This trend is observed not only in multi-core processors but also in many-core processors. Since deep packet inspection is often a bottleneck in packet processing, exploiting parallelism in multi-core and many-core architectures is a key to improving overall performance.
The winning criteria for multi-pattern regex matching engine are performance, memory footprint, compilation time and scalability. On the other hand, only a few patterns in the data set use complex regex syntax, most of regex patterns in the data set are simple as described in the following. Strings may be presented as an exact sequence of symbols or digits as illustrated in Table 1, for example as “hello”. Strings may be anchored to a certain position inside of the data set or not, for example as “^.{3}hello”. Parts of the string may be presented as character sets, for example as “he[l-n]o”. Strings may be case-sensitive or case-insensitive, e.g. per symbol.
TABLE 1Example of simple regex patternsregex syntaxExamplesimple string“hello”Anchor“{circumflex over ( )}.{3}hello” may correspond to “hello” withoffset 3character set“he[l-n]o” may correspond to “helo”, “hemo”and “heno”case“(?i)hello”(in)sensitive
Existing technical approaches implement regex matching by using finite automata, such as non-deterministic finite automaton (NFA), deterministic finite automaton (DFA), multiple deterministic finite automaton (mDFA), delayed input deterministic finite automaton (D2FA), hybrid finite automaton (HFA) and extended finite automaton (XFA). mDFA divides rules into different groups and compiles them into respective DFAs. D2FA compresses the edge of each state for each DFA state by using a default path. HFA compresses DFA states by using hybrid DFA and NFA. XFA compresses DFA states by adding bit or counter variables for each state. However, existing approaches, both NFA and DFA, have limitations. NFAs have excessive time complexity while DFAs have excessive space complexity and long compile times. mDFA increases the memory bandwidth since each of these DFAs must access the main memory once for each byte in the payload. D2FA has a longer compile time than DFA and cannot solve the excessive space complexity while DFA compiling. HFA cannot compress DFA states for simple regex expressions. With respect to XFA many variable read/write operations decrease the overall performance.
Another solution retrieves longest strings from each regex pattern, compiles all strings and matches the packet payload by using exact string matching algorithms such as Aho-Corasick (AC), Modified Wu-Manber (MWM), Trie, etc. After an exact string matching, a set of possible match results may be obtained and respective regex patterns may be verified for each match result one by one. The set of possible match results, however, can be large, especially for those regex patterns with short longest strings, e.g. with only 1 or 2 bytes. Respective regex expressions have to be verified which dramatically decreases the overall performance.