Although the amount of available digital information has mushroomed in recent years, limited data storage capacities and data communications bandwidths sometimes threaten the practicality of distributing this information. To deal with storage and bandwidth limitations, the use of data compression has become almost universal.
Various data compression techniques are available, two of the more popular being known as "zip" and "gzip". Both of these compression techniques utilize some form of pattern matching compression, in which a string beginning at a current data element is represented by referencing a previous, identical string. A well-known example of pattern matching compression (variations of which are used within both zip and gzip) is referred to as "LZ77".
Pattern matching compression involves sequentially examining data elements (also referred to herein as characters) and strings of data elements from a data input stream, and noting any strings that are repetitions of previously encountered identical strings. When the algorithm encounters an occurring string that matches a previously encountered string, the algorithm records two values in place of the occurring string: a length value and a displacement or distance value. The length value indicates the length of the matching strings. The displacement value indicates the number of elements back in the input stream to the previously occurring and matching string.
When the algorithm encounters a data element that cannot be matched to a previously encountered string, the algorithm records the value of the element itself. Such an element is referred to as a "literal" or "literal element."
Typically, the compressed data stream comprises literals with interspersed length/displacement pairs. A length element is always followed by a displacement element in the compressed data string.
In implementing a compression engine for pattern matching compression, it is usually desired to avoid repeated exhaustive searches of prior data elements. Instead, there is usually some way to record the locations of different strings as they are encountered, to ease the job of finding such strings when processing subsequent characters and strings. In many implementations, one or more lookup tables or hash tables are created and updated as the compression proceeds. A hash table contains a plurality of entries, each pointing to a linked list of previous input stream locations. As the algorithm advances through an input stream, it references the hash table and the linked lists to find previous matching strings, and also updates the table and lists to account for newly encountered data.
As an example, suppose a hash table such as this is indexed by three characters, and that the compression algorithm is attempting to match the string "bdeefis . . .". Referencing the hash table yields a linked list that leads to all previous strings that begin with the three characters "bde". The algorithm performs a string compare at locations of all such previous strings, to determine which yields the longest matching string.
As an improvement to this scheme, multiple hash tables are sometimes maintained, corresponding to different match lengths. For example, one hash table might be indexed with three characters, while another is indexed with four. In this case, the hash table and linked lists with the largest number of index characters are referenced first. Tables and lists with smaller numbers of index characters are referenced only if needed.
The preceding discussion is somewhat simplified, but is sufficient for understanding the characteristics of pattern matching compression that are pertinent to the invention. Further details regarding compression techniques can be found in M. Nelson & J. Gailly, The Data Compression Book, (2d ed. 1996), which is hereby incorporated by reference. In addition, specifications for the gzip and zip compression techniques can be found in Internet RFCs 1951 and 1952, which are also incorporated by reference.
The scheme described above works well for individual files. Groups of files can also be compressed by concatenating them so that matches can be found across file boundaries. This is referred to as "cross-file" pattern matching compression. Especially for short files, cross-file compression is much more efficient than independently compressing individual files--the longer input stream makes it more likely that matches will be found among the earlier data elements.
In a server environment, or any other environment where files are to be distributed through limited-bandwidth distribution channels, it is generally desired to store files in their compressed formats. This avoids the need for the server to recompress the files every time they are transmitted. If groups of files are to be transmitted, they can be concatenated, compressed using cross-file compression, and stored in their concatenated and compressed state.
In many situations, however, it is not possible to predict which combinations of files will be requested from a server. In these situations, the files must be compressed and stored individually--thus forgoing the advantages of cross-file compression. Alternatively, the files can be stored uncompressed, and then concatenated and compressed (using cross-file compression) in response to client requests. However, this places a tremendous load on the server, since each request requires fresh compression efforts.