The present application relates generally to an improved data processing apparatus and method and more specifically to mechanisms for high-throughput compression of data.
Data storage systems commonly employ data compression to increase the effective storage capacity of the physical storage media within the data storage system. One common data compression technique employed in GZIP compression is dynamic Huffman compression. A data compressor that employs a dynamic Huffman compression architecture encodes input data blocks (also referred to herein as “data pages”) utilizing a Lempel-Ziv77 (LZ77) encoder, extracts an optimal Huffman code for each LZ77-encoded data page, and then encodes each LZ77-encoded data page utilizing the optimal Huffman code for that data page to obtain compressed output data. The outputs of a dynamic Huffman compressor include the compressed output data and a code description of the optimal Huffman code utilized to encode each data page.
The LZ77 encoder achieves compression by replacing repeated occurrences of data with references to a single copy of that data existing earlier in the uncompressed data stream. A match is encoded by a pair of numbers called a length-distance pair, which is equivalent to the statement “each of the next length characters is equal to the characters exactly a distance of characters behind the character in the uncompressed stream”. The “distance” is sometimes called the “offset” instead.
To spot matches, the LZ77 encoder keeps track of some amount of the most recent data, such as the last 2 kB, 4 kB, or 32 kB. The structure in which this data is held is called a sliding window, which is why LZ77 is sometimes called sliding window compression. The LZ77 encoder needs to keep this data to look for matches, and the decoder needs to keep this data to interpret the matches the encoder refers to. The larger the sliding window is, the longer back the LZ77 encoder may search for creating references.