A wide variety of methods for data compression are known in the art. Many Web servers, for example, use the GZIP algorithm to compress Hypertext Transfer Protocol (HTTP) symbol streams that they transmit. GZIP is defined in Request for Comments (RFC) 1951 of the Internet Engineering Task Force (IETF), by Deutsch, entitled, “Deflate Compressed Data Format Specification” (1996), which is incorporated herein by reference. GZIP initially compresses the symbol stream using the LZ77 algorithm, as defined by Ziv and Lempel in “A Universal Algorithm for Sequential Data Compression,” IEEE Transactions on Information Theory (1977), pages 337-343, which is incorporated herein by reference. LZ77 operates generally by replacing recurring strings of symbols with pointers to previous occurrences. As the next stage in GZIP, the output of the LZ77 compression operation is further compressed by Huffman encoding, as is known in the art. The compressed HTTP stream is decompressed at the destination by Huffman decoding followed by LZ77 decompression.
Pattern matching algorithms are widely used in a variety of network communication applications. For example, Intrusion Detection Systems (IDS) use pattern matching in deep packet inspection. The packet content is typically checked against multiple patterns simultaneously for purposes such as detecting known signatures of malicious content.
The most common approach used at present in this type of multi-pattern matching is the Aho-Corasick algorithm, which was first described by Aho and Corasick in “Efficient String Matching: An Aid to Bibliographic Search,” Communications of the ACM 6, pages 333-340 (1975), which is incorporated herein by reference. (The term “multi-pattern matching,” as used in the context of the present patent application and in the claims, refers to scanning a sequence of symbols for multiple patterns simultaneously in a single process.) The Aho-Corasick algorithm uses a deterministic finite automaton (DFA) to represent the pattern set. The input stream is inspected symbol by symbol by traversing the DFA: Given the current state and the next symbol from the input, the DFA indicates the transition to the next state. Reaching certain “accepting states” of the DFA indicates to the IDS that the input may be malicious and should be handled accordingly.
A number of methods have been proposed for finding patterns in compressed data. For example, Navarro and Raffinot describe methods for searching a pattern in a text without uncompressing it in “A General Practical Approach to Pattern Matching over Ziv-Lempel Compressed Text,” Tenth Annual Symposium on Combinatorial Pattern Matching (1999), which is incorporated herein by reference. Another approach of this sort is described by Farach and Thorup, in “String Matching in Lempel-Zip Compressed Strings,” 27th Annual ACM Symposium on the Theory of Computing (1995), pages 703-712, which is also incorporated herein by reference.