This disclosure relates to data processing and data storage, and more specifically, to determining prefix codes for pseudo-dynamic data compression utilizing clusters formed based on compression ratio.
Data storage systems commonly employ data compression to increase the effective storage capacity of the physical storage media within the data storage system. One common data compression technique employed in GZIP compression is dynamic Huffman compression. A data compressor that employs a dynamic Huffman compression architecture encodes input data blocks (also referred to herein as “data pages”) utilizing a Lempel-Ziv77 (LZ77) encoder, extracts an optimal Huffman code for each LZ77-encoded data page, and then encodes each LZ77-encoded data page utilizing the optimal Huffman code for that data page to obtain compressed output data. The outputs of a dynamic Huffman compressor include the compressed output data and a code description of the optimal Huffman code utilized to encode each data page.
GZIP and other dynamic Huffman encoders are widely used due in part to their generally robust compression performance. However, reconstructing the optimal Huffman code from the code description during decompression is a time-consuming process that increases data access latency. In addition, for small data pages, the length of the code description, which may be on the order of hundreds of bytes, is significant compared to the length of the compressed data page and therefore adversely impacts the compression ratio achieved.
In light of the drawbacks associated with dynamic Huffman encoders, pseudo-dynamic compression can be utilized as an alternative. A pseudo-dynamic compressor may also encode input data pages with an LZ77 encoder, but utilizes a fixed set of K prefix codes to encode the LZ77-encoded data pages. The outputs of a pseudo-dynamic compressor include the compressed output data and a code index identifying which of the K prefix codes was used to encode each data page. Because the prefix codes are predetermined, there is no decompression latency penalty associated with reconstructing the optimal Huffman code for each data page from the code description. Instead, the prefix codes can be accessed via a simple memory lookup utilizing the code index. In addition, the code index, which can be on the order of two bytes or less, is significantly shorter than the code description of the optimal Huffman codes.
A key factor in the compression performance achieved by pseudo-dynamic compressors is the choice of the K prefix codes. In US 2013/01135123 A1, a technique for generating the K prefix codes is described in which K groups of data blocks having similar literal frequencies at the output of an LZ77 encoder are formed and then the conventional Huffman algorithm is utilized to determine an optimal prefix code for each data group. The present disclosure recognizes that this technique tends to be inefficient in the construction of the K groups. US 2015/0162936 A1 discloses an alternative technique in which the K groups are instead formed on the basis of an entropy computation. However, the present disclosure recognizes that formation of the K groups based on entropy does not in all cases yield the greatest compression ratio.