Pattern matching algorithms are widely used in a variety of network communication applications. For example, Intrusion Detection Systems (IDS) use pattern matching in deep packet inspection, for purposes such as detecting known signatures of malicious content.
A common approach used at present in this type of pattern matching is the Aho-Corasick algorithm, which was first described by Aho and Corasick in “Efficient String Matching: An Aid to Bibliographic Search,” Communications of the ACM 6, pages 333-340 (1975), which is incorporated herein by reference. The Aho-Corasick algorithm uses a deterministic finite automaton (DFA) to represent the pattern set. The input stream is inspected symbol by symbol by traversing the DFA: Given the current state and the next symbol from the input, the DFA indicates the transition to the next state. Reaching certain states of the DFA indicates to the IDS that the input may be malicious and should be handled accordingly.
Basic implementations of the Aho-Corasick algorithm require a large memory, since the DFA contains one transition rule for each pair of a current state and a next symbol. There have therefore been a number of attempts to compress such DFAs. For example, U.S. Patent Application Publication 2008/0046423, whose disclosure is incorporated herein by reference, describes a method for multi-character multi-pattern matching using a compressed DFA, with each transition based on multiple characters of the input stream. Each state of the compressed DFA represents multiple consecutive states of the original DFA, and each transition between the states of the compressed DFA is a combination of all of the transitions between the multiple consecutive states of the original DFA. The method can be implemented using a Ternary Content-Addressable Memory (TCAM) to store the transitions of the compressed DFA and compare the transitions with multiple characters of an input stream at a time to detect patterns in the input stream.