Many networking devices inspect the content of packets for security hazards such as word signatures and balancing decisions. These devices reside between the server and the client and perform Deep Packet Inspection (DPI). Using DPI, a device can examine the payload (and possibly also the header) of a packet, searching for protocol non-compliance, viruses, spam, intrusions, or other predefined criteria to decide whether the packet can pass, if it needs to be dropped or be routed to a different destination.
One of the challenges in performing DPI is traffic compression. In order to save bandwidth and to speed up web browsing, most major sites use compression. Recent studies show that approximately 38% of the websites compress their traffic. When focusing on the top 1,000 sites, a remarkable 78% of the sites compress their traffic. As the majority of this information is dynamic, it does not lend itself to conventional caching technologies. Therefore, compression is a top issue in the Web community.
The first generation of compression is intra-file compression, i.e. the file has references to back addresses. Two very common compression methods of the first generation are Gzip and Deflate which have been both developed in the 90's and are very common in HTTP compression. Both methods use combination of the LZ77 algorithm and the Huffman coding. LZ77 Compression reduces the string presentation size by spotting repeated strings within a sliding window of the uncompressed data. The algorithm replaces the repeated strings by (distance, length) pair, where distance indicates the distance in bytes of the repeated string and length indicates the string's length. Huffman Coding receives the LZ77 symbols as input and reduces the symbol coding size by encoding frequent symbols with fewer bits. Gzip or Deflate work well as the compression for each individual response, but in many cases there is a lot of common data shared by a group of pages, namely inter-response redundancy. Therefore, compression methods of the next generation are inter-file, where there is one dictionary that can be referenced by several files. An example of a compression method that uses a shared dictionary is SDCH.
Shared Dictionary Compression over Hypertext Transfer Protocol (HTTP) (HTTP) (SDCH) was proposed by Google Inc, thus, Google Chrome (Google's browser) supports it by default. Android is a software stack for mobile devices that includes an operating system, middleware and key applications. SDCH code appears also in the Android platform and it is likely to be used in the near future. Therefore, a solution for pattern matching on shared dictionary compressed data is essential for this platform as well. SDCH is complement to Gzip or Deflate, i.e. it could be used before applying Gzip. On webpages containing Google search results, the data size reduction when adding SDCH compression before Gzip is about 40% better than Gzip alone.
The idea of the shared dictionary approach is to transmit the data that is common to each response once and after that send only the parts of the response that differ. In SDCH notations, the common data is called the dictionary and the differences are stored in a delta file. Specifically, a dictionary is composed of the data used by the compression algorithm, as well as metadata describing its scope and lifetime. The scope is specified by the domain and path attributes, thus, a user can download several dictionaries, even from the same server.
Multi-patterns matching on compressed traffic requires two time-consuming phases being traffic decompression and pattern matching. Currently, most security tools either do not scan compressed traffic, or they ensure that there will not be compressed traffic by re-writing the HTTP header between the original client and server. The first method harms security and may be the cause to miss-detection of malicious activity, while the second one harms the performance and bandwidth of both client and server. The few security tools that handle HTTP compressed traffic, first construct the full page by decompressing it, and then perform signatures scan. Since security tools should operate in the network speed, this option is usually not feasible.
String matching algorithm is an essential building block for numerous applications; therefore, it has been extensively studied. Some of the fundamental algorithms are Boyer-Moore, which solve the problem for a single pattern and Aho-Corasick and Wu-Manber for multi-patterns. The basic idea of the Boyer-Moore algorithm is that more information is gained by matching patterns from the right than from the left. This allows to heuristically reduce the number of the required comparisons. The Wu-Manber algorithm uses the same observation; however, it provides a solution for the multi-pattern matching problem. The Aho-Corasick algorithm builds a Deterministic Finite Automaton (DFA) or a Finite State Machine (FSM) based on the patterns. Thereafter, while scanning an input text, the DFA is processed in a single pass.
There are also several works that target the problem of pattern matching on Lempel-Ziv compressed data. Specifically, a solution for Gzip HTTP traffic, called ACCH, is known in the art. This solution utilizes the fact that the Gzip compression algorithm works by eliminating repetitions of strings using back-references (pointers) to the repeated strings. ACCH stores information produced by the pattern matching algorithm, for the already scanned uncompressed traffic, and then in case of pointers, it uses this data in order to determine if there is a possibility of finding a match or it can skip scanning this area. This solution shows that pattern matching on Gzip compressed HTTP traffic with the overhead of decompression is faster than performing pattern matching on regular traffic. A similar conclusion regards files (as opposed to traffic) is known in the art.
However, all these algorithms are geared for first-generation compression methods, while there is no pattern matching algorithms for inter-file compression schemes, such as the rapidly-spreading SDCH. It would be therefore advantageous to provide a method and a system that perform pattern matching on SDCH based traffic without decompression the compressed data first.