The present invention pertains to the field of data compression techniques, in particular, lossless data compression techniques for efficient transmission of internet traffic over data communications links such as, satellite, terrestrial wireless or wired links.
Analysis of internet traffic reveals that for certain content types, which constitute a significant portion of the total traffic, a high degree of redundancy exists in the transmitted data. This manifests itself in the form of macro redundancies and micro redundancies. Macro redundancies are basically duplications of long byte strings, which occur when the same or similar data entities, (typically comprising hundreds of bytes or more) are repeatedly transmitted on a link between two end points. Micro redundancies occur due to the fine grain syntax underlying the byte sequences, which imposes a structure so that some smaller byte patterns (typically a few bytes in length) occur more frequently than others. Both of these types of redundancies must be fully exploited by lossless data compression techniques to transmit the data most efficiently. The benefit is conservation of communication link resources (such as channel bandwidth and power) as well as improvement in user experience due to lower latency and faster response time.
Redundancies in the data stream can appear at many levels. At the highest level, an entire web page or a document, which was previously transmitted may be retransmitted on the data stream (for example, due to user repeating the request for such an entity); at a lower level, an object within a web page (such as an image belonging to an advertisement in a web page) may be frequently retransmitted, because it is common across multiple popular web pages; or at the lowest level, a byte segment which was previously transmitted may reappear on the data stream. Each of these redundancies can be exploited by preventing the retransmission of the duplicate data, provided appropriate memory and processing techniques are employed at both ends of the connection. Further, the range (e.g., the separation in terms of the number of transmitted bytes from an occurrence of a byte segment to its redundant occurrence), over which redundancies occur in the data stream, can span from a few bytes to several tens or hundreds of megabytes. It is dependent on several factors such as the type of content, speed of the link, usage pattern of the user, the number of users attached to the end point etc. Moreover, the redundancies can be micro redundancies, where the duplications are only a few bytes long or much longer macro redundancies.
Lossless data compression is a powerful technique that compresses data streams for transmission over communications link by reducing data redundancies within the data streams, facilitating improved efficiency and utilization of link capacity. Lossless data compression algorithms exploit statistical redundancy to represent data more concisely, without losing information. A compressor is used to compress packets at one end of the link; at the other end of the link, a decompressor losslessly recovers the original packets. There exists a class of data compression techniques referred to as long-range data compression. Long-range data compression refers to compression techniques that compress data based on a relatively large data dictionary reflecting one or more data streams over a corresponding historical length of time (e.g., the length of time being proportional to the size of the dictionary—the larger the dictionary, the larger the storage capacity to cover longer periods of historical data). Some of the common current techniques for long-range data compression belong to the Lempel-Ziv family of compressors (LZ77 and LZ78, and derivatives thereof, such as gzip, compress, or V.44). Another class of data compression techniques exists, which are referred to as short-range data compression techniques. Rather than relying on a large dictionary (a long historical view of the data stream), short-range data compression techniques operate on small data sets, such as grammar-based algorithms, such as Yang-Kieffer (YK) universal data compression (see, e.g., U.S. Pat. Nos. 6,400,289 and 6,492,917). For example, grammar-based algorithms construct context-free grammar deriving from a single string, and also may apply statistical predictions referred to as arithmetic coding. Such current compression approaches, however, exhibit distinct disadvantages, especially in applications involving the compression of communications traffic (e.g., Internet traffic) that is classified into multiple streams at different priority levels for transport over communications links or channels.
Existing lossless data compression techniques have a stringent requirement that the packets cannot be reordered or lost during transport from the compressor to the decompressor. When traffic is transported as prioritized streams, however, this requirement can only be met on a per-stream basis, but not for the aggregate traffic as a whole. This is because a packet transported on a higher priority stream can overtake a packet transported on a lower priority stream. In other words, for example, in the case of a higher priority packet and a lower priority packet, where the higher priority packet is compressed later in time than the lower priority packet, but is provided transmission priority over the lower priority packet, the higher priority packet (while actually later in time at the compressor) will arrive at the decompressor earlier in time from the lower priority packet. Hence, the packets will arrive at the decompressor out of order, which would result in a failure of the decompression. Consequently, traditional compression techniques can be applied only on a per-stream basis and not on the aggregate traffic, which results in a significant sacrifice in performance. One such performance sacrifice manifests itself as a requirement that the total available memory pool be apriori sub-divided into smaller pools, each respectively associated with a one data stream. Accordingly, because the size of the memory pool represents a significant factor in determining compression performance, the compression would be adversely impacted (e.g., in the efficiency of link utilization). Another performance sacrifice comprises an inability to exploit inter-stream redundancies—e.g., redundancies between different streams cannot be exploited where the compression is applied on a per-stream basis.
What is needed, therefore, is an approach for lossless compression of data traffic (e.g., Internet traffic), in applications involving the compression of traffic that is classified into multiple data streams at different priority levels, where the approach facilitates data compression of the traffic on an aggregate level (as opposed to a per-stream basis), to improve the efficiency for transmission over communications links or channels (e.g., satellite, terrestrial wireless and wired links).