As the speed and size of networked computer systems continue to increase, so does the amount of data stored within, and exchanged among, such systems. Though effort has been focused on developing larger and denser storage devices as well as faster networking technologies, continually increasing demand for storage space and networking bandwidth has led to the development of technologies that further optimize storage space and bandwidth currently available on existing storage devices and networks. One such technology is data compression, where data saved to a storage device or transmitted across a network, for example, is modified to reduce the number of bytes required to represent the data. Accordingly, data compression may reduce the storage and bandwidth required to store and/or transmit the data.
Data compression can be divided into two general categories: lossy data compression and lossless data compression. As the terms imply, lossy data compression allows for some loss of fidelity in the compressed (e.g., encoded) information, while lossless data compression provides that the decompressed data be an exact copy of the original data, with no alterations or errors. While lossy data compression may be suitable for applications that process audio, image and/or video data, a great many other data processing applications benefit from the fidelity provided by lossless data compression.
Lossless compression techniques may use DEFLATE, which is a combination of Lempel-Ziv compression and Huffman encoding. Lempel-Ziv compression (LZ77) performs compression by matching a current input data sequence with a reference to a copy of that data sequence existing earlier in the input data stream. If a match is found, the match is encoded by a length-distance (L, D) pair. The length-distance pair indicates the equivalent of the statement “go back D characters from the current input data location, and copy L characters from that location.” To spot matches, an LZ77 encoder keeps track of the most recent data in the input data stream. The data structure in which this data is held is called a window, which is a sliding window that updates with time. The LZ77 encoder maintains this data to look for matches, and a corresponding LZ77 decoder maintains this data to interpret the matches to which the LZ77 encoder refers.
Huffman encoding is an entropy encoding process used for lossless data compression. Huffman encoding may use a variable-length code table for encoding a source symbol where the variable-length code table has been derived in a particular way based on an estimated or measured probability of occurrence for each possible value of the source symbol. Huffman encoding may create an un-prefixed tree of non-overlapping intervals, where the length of each sequence is inversely proportional to the probability of that symbol needing to be encoded. Accordingly, the more likely a symbol has to be encoded, the shorter its bit-sequence will be.
Thus, the first, LZ77 stage of compression looks for duplicate series of bytes (e.g. a replicated string), and replaces these with a back-reference (e.g., pointer) linking to the previous location of that identical string. A second, Huffman Encoding compression stage includes replacing commonly used symbols with shorter representations and less commonly used symbols with longer representations.