1. Field of the Invention
The present invention relates generally to lossless data compression, and more particularly to a method for iteratively reducing entropy of backward references in lossless data compression.
2. Background
Existing lossless byte stream compression techniques, such as the Deflate data compression algorithm, incorporate commonly occurring patterns in source input data in order to increase amount of compression. Lossless compression techniques may be categorized according to the type of source input data they are designed to compress. It is known that no lossless compression technique can efficiently compress all possible types of input data and that completely random input data streams cannot be compressed. For this reason, many different lossless byte stream compression techniques exist that are designed either with a specific type of input data in mind or with specific assumptions about what kinds of redundancy the source input data are likely to contain. For example, LZ77-based Deflate algorithm used by gzip is a part of compression process of Portable Network Graphics (PNG) and HTTP. For multimedia compression, discrete wavelet transform based JPEG and JEPG2000 algorithms take advantage of the specific characteristics of input images, such as the common phenomenon of contiguous two-dimensional areas of similar pixel values.
A typical lossless byte stream compression method, such as Deflate data compression algorithm, consists of two compression phases: backward reference selection followed by entropy coding, such as Huffman coding. The backward reference selection phase selects a backward reference for each block of uncompressed data elements so as to result in reduced entropy of the backward reference. A backward reference, i.e., a match, found at the backward reference selection phase, is typically described by a tuple (d, l), where d is backward reference distance from a search point to the data element in an input stream following the match, and l is the length of the backward reference. The entropy coding phase encodes source input data elements and backward references. In the backward reference selection phase, the backward references of the input data constitute a statistical model, or measurement, of the input data. The entropy coding phase maps the input data to bit sequences using this statistical model in such a way that frequently encountered data, i.e., “probable data”, will produce shorter bit sequences than “improbable data”.
Conventional backward reference selection algorithm in a Deflate implementation, like zlib, favors a backward reference with smaller backward distance among multiple available backward references. This approach is satisfactory when used for compression specific data files where there is some priori knowledge about the data files, such as statistics of backward distances distribution. To achieve fast data compression, this approach generally does not iteratively apply the backward reference selection algorithm to input data so as to generate a more accurate statistical model or measurement of the input data. However, a more accurate statistical model of the input data greatly enhances the entropy encoding performance by generating more compact entropy codes of the input data. As result, backward reference selection without taking into consideration the statistics of source input data is limited, and thus tends to increase entropy code length of source input data and backward references. Improvements that are compatible to the current lossless byte stream compression method can thus lead to significant savings in both data transfer bandwidth costs and persistent storage of data.