This disclosure relates to data processing and data storage, and more specifically, to initializing a pseudo-dynamic data compression system with predetermined history data typical of actual data.
Data storage systems commonly employ data compression to increase the effective storage capacity of the physical storage media within the data storage system. One common data compression technique employed in GZIP compression is dynamic Huffman compression. A data compressor that employs a dynamic Huffman compression architecture encodes input data blocks (also referred to herein as “data pages”) utilizing a Lempel-Ziv77 (LZ77) encoder, extracts an optimal Huffman code for each LZ77-encoded data page, and then encodes each LZ77-encoded data page utilizing the optimal Huffman code for that data page to obtain compressed output data. The outputs of a dynamic Huffman compressor include the compressed output data and a code description of the optimal Huffman code utilized to encode each data page.
GZIP and other dynamic Huffman encoders are widely used due in part to their generally robust compression performance. However, reconstructing the optimal Huffman code from the code description during decompression is a time-consuming process that increases data access latency. In addition, for small data pages, the length of the code description, which may be on the order of hundreds of bytes, is significant compared to the length of the compressed data page and therefore adversely impacts the compression ratio achieved.
In light of the drawbacks associated with dynamic Huffman encoders, pseudo-dynamic compression can be utilized as an alternative. A pseudo-dynamic compressor may also encode input data pages with an LZ77 encoder, but utilizes a fixed set of K prefix codes to encode the LZ77-encoded data pages. The outputs of a pseudo-dynamic compressor include the compressed output data and a code index identifying which of the K prefix codes was used to encode each data page. Because the prefix codes are predetermined, there is no decompression latency penalty associated with reconstructing the optimal Huffman code for each data page from the code description. Instead, the prefix codes can be accessed via a simple memory lookup utilizing the code index. In addition, the code index, which can be on the order of two bytes or less, is significantly shorter than the code description of the optimal Huffman codes.
Compression algorithms that are Lempel-Ziv based have an initial transient phase in which compression of a data block is less efficient than in the stationary phase in which the data becomes essentially ergodic and the history data structure (e.g., history buffer) of the Lempel-Ziv encoder is filled with typical data from the data block. The present disclosure appreciates that the inefficiency of the transient phase for each data block is due to the fact the history data structure by reference to which a conventional Lempel-Ziv encoder encodes a data block is empty when compression of the data block begins.