Field of the Invention
The present disclosure relates to the field of network intrusion detection technology, and more particularly to a method for compressing matching automata through common prefixes in regular expressions.
Description of the Related Art
Finding known suspicious patterns in network data streams is a vital task for modern networks. Network Intrusion Detection Systems (NIDS) perform this function of dividing network traffic into two groups: suspicious and benign. Suspicious traffic is often determined by a byte-by-byte examination of the data in a particular network datagram. The NIDS contain a set of patterns that are known to be suspicious. These known suspicious patterns are typically expressed as fixed-strings or as regular expressions. Fixed-strings represent a single, unchanging, string of characters (binary or otherwise) like the word: “Host” or “User-agent.” As implied in the name, a fixed-string will never match to any pattern other than itself. A regular expression, however, describes an entire language that can be recognized by the expression. For example, the regular expression “ab*cd” can match the inputs acd, abcd, abbcd, and so on. Regular expressions provide far more expressive power to the pattern creator. This power can be used to make more dynamic patterns that can not only eliminate false positives but also defeat evasive tactics employed by rogue traffic. The wider context of regular expressions makes them far more valuable to NIDS than fixed-strings as they provide both more specific and more dynamic matching. However, this improved matching comes at the cost of greater complexity.
The greater complexity of regular expression matching requires more careful use of resources in order to match efficiently. Greater efficiency is typically achieved by improving the matching algorithm, such as in Hybrid Automata, Extended Finite Automata (XFA), and Delayed Input Deterministic Finite Automata (D2FA) or by exploiting parallelism in hardware using Field Programmable Gate Arrays, Graphics Processing Units (GPU), Cell processors, and Ternary Content Addressable Memory or by exploiting both hardware parallelism on modern General Purpose Processors (GPP) and creating an architecture-friendly layout of the matching automaton as in GPP-grep.
Despite these advances in matching architectures and hardware, matching is still plagued by a significant problem: the substantial size of the matching automaton. Larger matching automata, regardless the architecture or hardware implementation, require more resources during matching. In some instances matching automata can grow so large as to not fit in the available memory for only a couple hundred regular expressions. The average NIDS typically has thousands of regular expressions. Efforts have been made to compress the Deterministic Finite Automata (DFA), but such compression leads to more overhead during matching. Finally, complex regular expressions, particularly those with lots of repetition and counting, when compiled into a single Non-deterministic Finite Automata (NFA), can create matching automata with many redundant paths as minimizing this redundancy is a hard problem.