The present invention pertains to the field of data compression techniques, in particular, lossless data compression techniques for efficient transmission of internet traffic over data communications links such as, satellite, terrestrial wireless or wired links.
Analysis of internet traffic reveals that for certain content types, which constitute a significant portion of the total traffic, a high degree of redundancy exists in the transmitted data. This manifests itself in the form of macro redundancies and micro redundancies. Macro redundancies are basically duplications of long byte strings, which occur when the same or similar data entities, (typically comprising hundreds of bytes or more) are repeatedly transmitted on a link between two end points. Micro redundancies occur due to the fine grain syntax underlying the byte sequences, which imposes a structure so that some smaller byte patterns (typically a few bytes in length) occur more frequently than others. Both of these types of redundancies must be fully exploited by lossless data compression techniques to transmit the data most efficiently. The benefit is conservation of communication link resources (such as channel bandwidth and power) as well as improvement in user experience due to lower latency and faster response time.
Redundancies in the data stream can appear at many levels. At the highest level, an entire web page or a document, which was previously transmitted may be retransmitted on the data stream (for example, due to user repeating the request for such an entity); at a lower level, an object within a web page (such as an image belonging to an advertisement in a web page) may be frequently retransmitted, because it is common across multiple popular web pages; or at the lowest level, a byte segment which was previously transmitted may reappear on the data stream. Each of these redundancies can be exploited by preventing the retransmission of the duplicate data, provided appropriate memory and processing techniques are employed at both ends of the connection. Further, the range (e.g., the separation in terms of the number of transmitted bytes from an occurrence of a byte segment to its redundant occurrence), over which redundancies occur in the data stream, can span from a few bytes to several tens or hundreds of megabytes. It is dependent on several factors such as the type of content, speed of the link, usage pattern of the user, the number of users attached to the end point etc. Moreover, the redundancies can be micro redundancies, where the duplications are only a few bytes long or much longer macro redundancies.
Lossless data compression is a powerful technique that compresses data streams for transmission over communications links by reducing data redundancies within the data streams, facilitating improved efficiency and utilization of link capacity. Lossless data compression algorithms exploit statistical redundancy to represent data more concisely, without losing information. A compressor is used to compress packets at one end of the link; at the other end of the link, a de-compressor losslessly recovers the original packets. There exists a class of data compression techniques referred to as long-range data compression. Long-range data compression refers to compression techniques that compress data based on a relatively large data dictionary reflecting one or more data streams over a corresponding historical length of time (e.g., the length of time being proportional to the size of the dictionary—the larger the dictionary, the larger the storage capacity to cover longer periods of historical data). Some of the common current techniques for long-range data compression belong to the Lempel-Ziv family of compressors (LZ77 and LZ78, and derivatives thereof, such as gzip, compress, or V.44). Another class of data compression techniques exists, which are referred to as short-range data compression techniques. Rather than relying on a large dictionary (a long historical view of the data stream), short-range data compression techniques operate on small data sets, such as grammar-based algorithms, such as Yang-Kieffer (YK) universal data compression (see, e.g., U.S. Pat. Nos. 6,400,289 and 6,492,917). For example, grammar-based algorithms construct context-free grammar deriving from a single string, and also may apply statistical predictions referred to as arithmetic coding.
Current long-range data compression techniques, however, suffer from significant disadvantages. For example, such techniques require a dictionary or cache at both the compression and decompression ends, where (as explained in further detail below) the cache at the decompression end is required to be at least the same size (or larger) as the cache at the compression end. Further, in a system where a communications hub supports a multitude of end-user communications terminals (e.g., a satellite hub supporting a multitude of end-user satellite terminals, potentially amounting to tens of thousands of terminals per hub), the hub is required to maintain a compression cache for each end-user terminal. Such existing long-range data compression techniques thus suffer from scalability issues. For example, one aspect of such techniques is that the compression performance increases with increases in the size of the respective compression and decompression caches. Accordingly, in order to increase the size of the decompression caches within an end-user terminal, the respective compression caches corresponding to each end-user terminal must similarly be increased in the hub. It follows that, for example, in a case where a hub supports 10,000 terminals and the compression is applied at the hub side (on the outroute transmissions), a 1 GB increase in the cache sizes of each terminal manifests itself in a requirement for a 1 GB increase in each respective compressor cache in the hub—amounting to a total memory increase of 10,000 GB within the hub (1 GB for each compression cache for each terminal).
What is needed, therefore, is a resource efficient scalable approach for high compression gain lossless long-range compression of data traffic (e.g., Internet traffic), in systems where a communications hub supports a multitude of communications terminals.