LZ77 Compression Algorithm.
Compression algorithms strive to reduce an amount of data without sacrificing the information within the data. One type of compression algorithm, referred to as the LZ77 algorithm, achieves compression by replacing repeated occurrences of data with references to a single copy of that data existing earlier in the input (uncompressed) data stream. A match is encoded by a pair of numbers called a length-distance pair (the “distance” is sometimes called the “offset” instead).
To spot matches, the encoder keeps track of some amount of the most recent data, such as the last 2 kB, 4 kB, or 32 kB. The structure in which this data is held is called a “sliding window” (as such, LZ77 is sometimes called sliding window compression). The encoder keeps the most recent data within the sliding window to look for matches (and the decoder likewise will keep this data to interpret the matches the encoder refers to).
FIG. 1 shows a simple example of an LZ77 encoding scheme. As observed in FIG. 1, the bit patterns of a preceding (earlier or older) portion 101 of a bit stream 100 is compared against a current portion 102 of the bit stream. If a sequence of bits is found in the current portion 102 that matches a sequence of bits in the preceding portion 101, the sequence of bits in the current portion 102 is replaced with a reference to the same sequence of bits in the earlier portion 101. For example, the bit sequence in the current portion 102 would be replaced with a reference to bit sequence 103 in the earlier portion 101.
The reference that is inserted for bit sequence 102 identifies the length 104 of bit sequence 102 (which also is the same as the length of bit sequence 103) and the location of bit sequence 103. Here, the location of bit sequence 103 is expressed as a “distance” 105 from the current portion 102 to the matching bit sequence 103. As such, the LZ77 compression scheme encodes a bit sequence 102 as a “length, distance pair” that is inserted in the bit stream in place of sequence 102. Upon decoding the compressed stream, when the decoder reaches the length, distance pair that is embedded in the bit stream in place of bit sequence 102, it simply uses the distance part of the length, distance pair to refer back to the start of bit sequence 103 and reproduces the correct bit sequence for portion 102 of the decoded stream by reproducing a number of bits from the start of bit sequence 103 that is equal to the length component of the length, distance pair.
DEFLATE Compression Algorithm.
The DEFLATE compression scheme, which is used to compress gzip, Zlib, PKZip and WinZip files, uses the LZ77 compression algorithm along with other compression schemes to effect a comprehensive overall compression scheme.
FIG. 2 shows an overview of the DEFLATE compression algorithm. As observed in FIG. 2, after LZ77 compression, the compressed bit stream 200 can be viewed as a series of length/distance pairs 201_1, 201_2, . . . 201_M intermixed with literals 202_1, 202_2, . . . 202_N. Literals correspond to bit patterns within the original bit stream for which no earlier identical pattern could be identified within the applicable window for conversion into a length/distance pair.
The DEFLATE compression algorithm then proceeds to incorporate a next level of compression 203 upon the LZ77 compressed stream 200. The next level of compression 203 introduces two different types of Huffman encoding that together replace more common bit patterns of the length/distance pairs 201 and literals 202 with smaller codes 204 and less common bit patterns of the length/distance pairs 201 and literals 202 with larger codes 205. A first type of Huffman encoding is used to encode literals and lengths. A second type of Huffman encoding is used to encode distances. By representing more common bit patterns of the LZ77 compressed stream 200 with fewer bits, the overall size of the information as presented in the final DEFLATE compressed stream 206 should be reduced.
A representation of the first type of Huffman encoding, used for literals and lengths, is presented in FIG. 3. As observed in FIG. 3, literal information is broken down on a byte-by-byte basis. As a byte corresponds to 8 bits of information, there are 2^8=256 different literal byte values (from 0 to 255 in decimal terms). Each literal byte value corresponds to a node in a Huffman tree, where, the identity of the nodes themselves correspond 1:1 with the values of the literals (i.e., a literal byte of 00000000 corresponds to a Huffman tree node identity of 0, a literal byte of 00000001 corresponds to a Huffman tree node identity of 1, . . . , a literal byte of 11111111 corresponds to a Huffman tree node identity of 255).
Each Huffman tree node has an associated encoding value that is directly inserted into the bit stream as an encoded symbol for that tree node's corresponding literal byte. Thus, for instance, Huffman tree node 0 has a Huffman encoding of 00110000 and Huffman tree node 255 has a Huffman encoding of 111111111. As such, a literal byte of 00000000 in stream 203 will be encoded in the DEFLATE compressed bit stream 206 as 00110000, and, a literal byte of 11111111 in stream 203 will be encoded in the DEFLATE compressed bit stream 206 as 111111111. Notably, a literal byte of 00000000 has a higher probability of occurrence than a literal byte of 11111111, and, as such, the encoding of a literal byte of 00000000 in stream 200 consumes less bit space (00110000 has 8 bits) in the finally encoded bit stream 206 than the encoding of a literal byte 11111111 in stream 200 (111111111 has 9 bits).
The Huffman tree also has a node with an identity of 256. That node corresponds to the appearance in stream 200 of an end of block (EOB) symbol. In the deflate compression scheme, the overall data is broken down into smaller blocks and the demarcation between neighboring blocks is marked with an EOB symbol. For simplicity an EOB symbol is not shown in stream 200 nor is its encoded value shown in stream 206.
The Huffman tree includes an additional 29 nodes having identities 257 through 285 that are used to encode the length information (window size) of a length, distance pair. The length information can be 3 to 258 bytes. Here, tree identities 257 through 264 and 285 correspond to specific (and more frequently encountered) lengths (specifically, identity 257 corresponds to a length of 3 bytes, identity 258 corresponds to a length of 4 bytes, . . . etc., . . . identity 264 corresponds to a length of 10 bytes and identity 285 corresponds to a length of 258 bytes). Each of identities 257 through 264 and 258 are encoded with 6 bits or less (with more frequent lengths consuming less than 6 bits and less frequent lengths consuming up to 6 bits.
Identities 265 through 284 of the Huffman tree are used to specify length ranges rather than specific lengths. Here, lengths within a range 11 bytes to 257 bytes are specified across identities 265 through 284. Each Huffman tree node identity corresponds to a different range of lengths. For example, identity 265 corresponds to a length range of 11 or 12 bytes. By contrast, identity 284 corresponds to a length range of 227 bytes to 257 bytes. In order to specify a particular length from a Huffman code node identity that corresponds to a range of lengths, “extra bits” are added to the encoding of a Huffman code node identity. For example, one extra bit is added to the encoding for Huffman code node identity 265 so that two lengths (11 or 12 bytes) can be specified. By contrast, 5 extra bits are added to the Huffman code node identity 284 so that 31 different lengths (i.e., any one of lengths 227 through 257 inclusive) can be individually specified.
Notably, the encodings for any of Huffman node identities 0 through 285 are “non overlapping” which means their bit sequences are unique. For example, if one of the shortest encodings is 1010, no other encoding, shortest or otherwise, begins with the bit sequence 1010. As such, when the fully encoded bit stream is decoded, each individual encoded symbol is easy to recognize and can only correspond to one 8 bit pattern if a literal or length. As discussed above, some encoded lengths have associated extra bits. As observed in stream 205, any extra bits are appended to the encoded length. Thus, for instance, if a specific bit sequence is recognized in stream 205 by a decoder as corresponding to node identity 265, it is then immediately recognized that the next bit after the specific bit sequence must be the extra bit for that node identity. As another example, if a specific bit sequence is recognized in stream 205 by a decoder as corresponding to node identity 284, it is then immediately recognized that the next five bits after the specific bit sequence must be the extra bits for that node identity.
Distances are encoded according to a similar technique as lengths but a different Huffman tree is utilized (Huffman tree of second type, not shown). The second type of Huffman tree used for distances has 30 nodes instead of 286 (as with literal/length encodings) and is used to encode any distance from 1 byte to 32,768 bytes. Again, more frequent distances correspond to a lower tree node identity and a fewer number of bits in the encoded symbol, whereas, less common distances correspond to a higher tree node identity, more bits in the encoded symbol and the use of extra bits. For example, the 30th node in the second type of Huffman tree is used to specify any distance within a range of 16,385 bytes to 32,768 bytes, and, 13 extra bits are utilized in conjunction with the encoded bit pattern for the 30th node to specify a particular one of the distances within this range.
A problem with the decoding of a DEFLATE encoded data-stream is the sheer complexity of the decoding process which consumes a large number of CPU instructions when executed in software with generic instructions.