Although the amount of available digital in formation has mushroomed in recent years, limited data storage capacities and data communications bandwidths sometimes threaten the practicality of distributing this information. To deal with storage and bandwidth limitations, the use of some type of data compression has become almost universal.
Various data compression techniques are available , two of the more popular being, known as "zip" and "gzip". In accordance with these prior art techniques, a Compressed data stream consists of a series of output blocks corresponding to successive blocks of input data. Each input data block is compressed in two passes: a first pass using sliding window compression techniques, and a second pass using minimum redundancy coding techniques.
Sliding window compression involves sequentially examining data elements (also referred to herein as characters) and strings of data elements from a data input stream, and noting any strings that are repetitions of previously encountered identical strings. The term "sliding window" is used because the algorithm searches for previously encountered strings only within a "window" on the data input string, wherein the window includes only a defined number of data elements prior to the currently examined data element. The window moves as the algorithm progresses through the data input string.
When the algorithm encounters a string that matches a previously encountered string that is within the sliding window, the algorithm records two values: a length value and a displacement or distance value. The length value indicates the length of the matching string. The displacement value indicates the number of elements back in the input stream to the previously occurring string that the current string matches.
When the algorithm encounters a data element that is not part of a matching string, the algorithm records the value of the element itself. Such an element is referred to as a "literal" or "literal element."
Typically, the compressed data stream comprises literals with interspersed length/displacement pairs. A length element is always followed by a displacement element in the compressed data string.
A well-known example of sliding window compression is referred to as "LZ77".
Minimum redundancy coding, also referred to as Huffman coding or prefix coding, represents different data element values (from an a priori known stream of data elements) by codes (bit sequencesone code for each data element value. The codes are defined such that different values may be represented by bit sequences of different lengths, but such that a parser can always parse a coded string, unambiguously, value-by-value. The correspondence between codes and data element values is defined by what is referred to as a "coding tree." A coding tree is typically optimized for a specific set of data elements. To calculate an optimized coding tree, the data element set is analyzed to rank each possible element value according to its frequency of occurrence in the data set. Those values that occur most frequently are assigned codes with relatively short bit lengths, while less frequently occurring values are assigned longer codes.
In the public implementation of the GZIP compression method, sliding window compression is used in a first pass, with the compressed output being stored in two different buffers: one for literal and length values, and another for displacement values. Minimum redundancy coding is performed in a second compression pass performed on the two output buffers of the first pass. Storing the first pass output in two different buffers allows convenient statistical analysis of the respective output data in order to calculate two corresponding coding trees. One coding tree is calculated for use in coding the literal and length values, while another coding tree is calculated for use in coding the displacement values. The coding trees are recalculated for each block of data. Separate coding trees are used for the two buffers because the data elements of the two buffers are of different compositions: in the case of gzip, the displacement values are 16-bits in size, while 9 bits can fully specify literal or length values.
FIGS. 1-3 illustrate two-pass compression in accordance with the prior art, using a combination of sliding window compression and minimum redundancy coding. FIG. 1 illustrates a first compression pass which implements sliding window compression. FIG. I shows an input stream 10, a literal/length buffer 12. and a displacement buffer 14. Suppose that the compression algorithm has reached the character "x.sub.1 " in input stream 10 (processing from left to right), and that this character does not form part of a string that can be matched to any previous string. Since this character is not part of a matching string, it is written as a literal (Lit) to the literal/length buffer 12. An arrow in FIG. 1 indicates the process of writing x.sub.1 to the next available location in literal/length buffer 12. To differentiate this 8-bit literal value from an 8-bit length value, a value of zero is stored in the next location in the displacement buffer, 14. Now suppose that the compression algorithm reaches the character "t" (indicated by reference numeral 15) that forms the first letter of the word or string "the", where the string "the" can be found in previously examined characters of the input string. FIG. 1 indicates the length of the string (three characters) and the displacement back to the most previous occurrence of the same string (six characters). In this case, the length value is written to the next available location in the literal/length buffer 12 and the non-zero displacement value is written to the next available location in the displacement buffer 14.
FIG. 2 illustrates a second compression pass, using minimum redundancy coding. The second pass takes place whenever either the literal length buffer 12 becomes full, or when displacement buffer 14 becomes full. In the second pass, 14 the contents of the two buffers 12 and 14 are Huffinan coded and merged into a single output stream 16. This involves first defining a coding tree optimized for the literal/length buffer 12 and another coding tree optimized for the displacement buffer 14. Then, the literal/length buffer and displacement buffer are read at the same time. A zero value in the displacement buffer indicates that a literal is present at the corresponding place in the literal buffer/length buffer, and a non-zero value in the displacement buffer indicating that a length value is present at said index. Each literal or length element is Huffman encoded, and the results are copied to output stream 16. Whenever a length value is written to the output stream 16, its corresponding non-zero displacement value is encoded after it in the stream. Most encoders do not write the displacement verbatim as a 16-bit value--rather the displacement is output as a Huffman-encoded slot number. Such a slot number specifies one of a plurality of value ranges (the value ranges can have different sizes). The slot number is followed by a second value (having a small number of bits) that pinpoints a specific value within the range indicated by the slot number. The slot number and the following second value will be generally referred to herein as a "slot-type" designation.
FIG. 3 shows the steps that are performed in producing compressed output stream 16 from input stream 10. A first step 30 comprises determining whether there is a string match at the next data element to be examined--whether the data element is the starting element of a string that has previously occurred. If the to result of this test is true, a step 32 is performed of writing the length of the string to the literal/length buffer 12 and the corresponding displacement value to the displacement buffer 14. Otherwise, if the result of step 30 is false, a step 34 is performed of writing a literal element to the literal/length buffer and writing a value of zero to the displacement buffer 14.
After either of these steps, a decision step 36 is performed, determining whether either of the two buffers (the literal/length buffer 12 or the displacement buffer 14) is full. If not, processing continues with the next data element in step 30. If a literal element has just been processed, the next element is the one immediately following the character just processed. If a matching string has just been processed, the next data element is the one following the matching string.
If one of the buffers has become full, a step 40 is performed of calculating a coding tree for the values in the literal/length buffer 12 and coding the buffer values with the calculated coding tree. A step 42 is performed of calculating a separate Huffinan coding tree for the values in the displacement buffer 14 and coding those buffer values with the calculated coding tree. A step 44 comprises compiling or concatenating the coded values and outputting them in compressed output stream 16. The Huffman coding trees themselves are also output as part of the compressed output stream 16, for subsequent use in decompression. The two buffers are cleared during step 44.
After step 44, the process continues back at step 30, with the next data element.
The description above is somewhat simplified, but is sufficient for understanding the characteristics of two-pass compression that are pertinent to the invention. Further details regarding compression techniques, including sliding window and minimum redundancy compression techniques, can be found in M. Nelson & J Gailly, The Data Compression Book, (2d ed. 1996), which is hereby incorporated by reference. In addition, specifications for the gzip and zip compression techniques can be found in Internet RFCs 1951 and 1952, which are also incorporated by reference.
Although the technique illustrated by FIGS. 1-3 is effective it is not very efficient with regard to its use of buffers. One inefficiency results from the fact that each literal occupies three bytes of storage (the 8-bit literal value itself, and the 16-bit displacement value of zero). Some prior art attempts to solve this problem by not storing a zero displacement value in the displacement buffer, and instead using a bitmap to indicate whether an entry in the literal/length buffer is an 8-bit match length or an 8-bit literal.
Another inefficiency results when one of the buffers fills up before the other buffer, so that the remaining space in the other buffer is not utilized. Although the displacement buffer is typically allocated with a smaller size than the literal/length buffer, it is impossible to size the buffers relative to each other so that they will fill up at the same time--since this depends on the characteristics of the data being compressed.
A further inefficiency results from the way data is stored in the buffers. Although displacements may be as large as the window size (up to 32767 in the case of gzip) almost displacements are significantly smaller, and therefore would benefit from a more compact encoding, rather than the reservation of the full 16 bits.
The inventor has found a way to make more efficient use of buffers when implementing a two-pass compression scheme.