The present invention pertains to the field of data compression techniques, in particular, lossless data compression techniques for efficient transmission of internet traffic over data communications links such as, satellite, terrestrial wireless or wired links.
Analysis of internet traffic reveals that for certain content types, which constitute a significant portion of the total traffic, a high degree of redundancy exists in the transmitted data. This manifests itself in the form of macro redundancies and micro redundancies. Macro redundancies are basically duplications of long byte strings, which occur when the same or similar data entities, (typically comprising hundreds of bytes or more) are repeatedly transmitted on a link between two end points. Micro redundancies occur due to the fine grain syntax underlying the byte sequences, which imposes a structure so that some smaller byte patterns (typically a few bytes in length) occur more frequently than others. Both of these types of redundancies must be fully exploited by lossless data compression techniques to transmit the data most efficiently. The benefit is conservation of communication link resources (such as channel bandwidth and power) as well as improvement in user experience due to lower latency and faster response time.
Redundancies in the data stream can appear at many levels. At the highest level, an entire web page or a document, which was previously transmitted may be retransmitted on the data stream (for example, due to user repeating the request for such an entity); at a lower level, an object within a web page (such as an image belonging to an advertisement in a web page) may be frequently retransmitted, because it is common across multiple popular web pages; or at the lowest level, a byte segment which was previously transmitted may reappear on the data stream. Each of these redundancies can be exploited by preventing the retransmission of the duplicate data, provided appropriate memory and processing techniques are employed at both ends of the connection.
The range (i.e., the separation in terms of the number of transmitted bytes from an occurrence of a byte segment to its redundant occurrence), over which redundancies occur in the data stream, can span from a few bytes to several tens or hundreds of megabytes. It is dependent on several factors such as the type of content, speed of the link, usage pattern of the user, the number of users attached to the end point etc. Moreover, the redundancies can be micro redundancies, where the duplications are only a few bytes long or much longer macro redundancies.
Some of the common techniques for internet data compression belong to the Lempel-Ziv family of compressors (LZ77, LZ78 or its derivatives such as gzip, compress, or Hughes V.44), or more recently grammar transform based compressors (for example, the Hughes Network Systems Inc., YK Compressor). The problem with these compression techniques is that they become overly complex and impractical (for stream data compression applications) when their dictionary, grammar, or history window size is increased significantly. These techniques can only use data within a relatively short history window (or equivalently, a small dictionary or grammar) that ranges from a few tens of kilobytes to a few megabytes in size. This means that these techniques are only capable of exploiting redundancies within a relatively small span of consecutive bytes, or a “window,” that ranges from a few tens to a few kilobytes to a few megabytes. Since internet web traffic exhibits redundancies across tens of megabytes or more, these techniques cannot be directly used to translate such long range redundancies into compression gain.
Another important limitation of these techniques is that they cannot compress entities that have already been compressed at the source. For example, an embedded image in a web page is typically compressed (as a GIF, PNG or JPEG object). These techniques cannot compress such compressed objects. If such objects are processed by these techniques it may actually increase the size of the object, which is undesirable.
A further disadvantage of the LZ family of compressors is that they are inherently ill-suited for using arithmetic coding for entropy coding of the LZ compressor tokens in a manner that fully exploits the optimality of arithmetic coding. It is well known that arithmetic coding is the most efficient form of entropy coder. Consequently, the performance of this type of coders is in general suboptimal. However, grammar-based compressors do not possess this short coming. In fact, the combination of a grammar transform and arithmetic coding (i.e., grammar-based compressor) has been shown to outperform the LZ77 and LZ78 compressors. Grammar-based compressors and grammar-based decompressors are described in U.S. Pat. No. 6,400,289 B1, Jun. 4, 2002, and U.S. Pat. No. 6,492,917 B1, Dec. 10, 2002, the entire contents of which are incorporated herein by reference.
What is needed is a technique for lossless data compression to improve the efficiency of the transmission of internet traffic over communication links such as, satellite or terrestrial links by having the capability of compressing entities that have already been compressed at the source, given sufficient compressor memory (cache size).