1. Field of the Invention
This invention relates to systems and methods for lossless compression of data.
2. Background of the Invention
Modern lossless data compression is a class of data compression algorithms that allow the original data to be perfectly reconstructed from the compressed data. By contrast, lossy data compression permits reconstruction only of an approximation of the original data, while this usually allows for improved compression rates.
DEFLATE is a lossless data compression algorithm that uses a combination of the LZ77 algorithm and Huffman coding. It was originally defined by Phil Katz for version 2 of his PKZIP archiving tool and was later specified in standard RFC 1951. DEFLATE has widespread uses, for example in GZIP compressed files, PNG (Portable Network Graphic) image files and the ZIP file format for which Katz originally designed it.
LZ77 compression works by finding sequences of data that are repeated. The term “sliding window” is used; all it really means is that at any given point in the data, there is a record of what characters went before. A 32K sliding window means that the compressor (and decompressor) have a record of what the last 32768 (32*1024) characters were. When the next sequence of characters to be compressed is identical to one that can be found within the sliding window, the sequence of characters is replaced by two numbers: a distance, representing how far back into the window the sequence starts, and a length, representing the number of characters for which the sequence is identical.
The compressor uses a chained hash table to find duplicated strings, using a hash function that operates on typically 2 or 3-byte sequences. At any given point during compression, let XYZ be the next 3 input bytes to be examined (not necessarily all different, of course). First, the compressor examines the hash chain for XYZ. If the chain is empty, the compressor simply writes out X as a literal byte and advances one byte in the input. If the hash chain is not empty, indicating that the sequence XYZ (or, if we are unlucky, some other 3 bytes with the same hash function value) has occurred recently, the compressor compares all strings on the XYZ hash chain with the actual input data sequence starting at the current point, and selects the longest match.
The compressor searches the hash chains starting with the most recent strings, to favor small distances and thus take advantage of the Huffman encoding. The hash chains are singly linked. There are no deletions from the hash chains; the algorithm simply discards matches that are too old. To avoid the worst-case situation, very long hash chains are arbitrarily truncated at a certain length, determined by a run-time parameter.
To improve overall compression, the compressor optionally defers the selection of matches (“lazy matching”): after a match of length N has been found, the compressor searches for a longer match starting at the next input byte. If it finds a longer match, it truncates the previous match to a length of one (thus producing a single literal byte) and then emits the longer match. Otherwise, it emits the original match, and, as described above, advances N bytes before continuing.
Lempel-Ziv-Storer-Szymanski (LZSS) was created in 1982 by James Storer and Thomas Szymanski. The LZSS decompressor has the form:                For each copy item, fetch a “literal/copy” bit from the compressed file.        0: literal: the decoder grabs the next byte from the compressed file and passes it straight through to the decompressed text.        1: copy item: the decoder grabs the next 2 bytes from the compressed file, breaks it into a 4 bit “length” and a 12 bit “distance”. The 4 “length” bits are decoded into a length from 3 to 18 characters. Then find the text that starts that “distance” back from the current end of decoded text, and copy “length” characters from that previously-decoded text to end of the decoded text.        Repeat from the beginning until there is no more items in the compressed file.        
A Huffman code is a prefix code prepared by a special algorithm. Each code is a series of bits, either 0 or 1, representing an element in a specific “alphabet” (such as the set of ASCII characters, which is the primary but not the only use of Huffman coding in DEFLATE).
A Huffman algorithm starts by assembling the elements of the “alphabet,” each one being assigned a “weight”—a number that represents its relative frequency within the data to be compressed. These weights may be guessed at beforehand, or they may be measured exactly from passes through the data, or some combination of the two. In any case, the elements are selected two at a time, the elements with the lowest weights being chosen. The two elements are made to be leaf nodes of a node with two branches.
When all nodes have been recombined into a single “Huffman tree,” then by starting at the root and selecting 0 or 1 at each step, you can reach any element in the tree. Each element now has a Huffman code, which is the sequence of 0's and 1's that represents that path through the tree.
Now, it should be fairly easy to see how such a tree, and such a set of codes, could be used for compression. If compressing ordinary text, for example, probably more than half of the ASCII character set could be left out of the tree altogether. Frequently used characters, like ‘E’ and ‘T’ and ‘A,’ will probably get much shorter codes, and even if some codes are actually made longer, they will be the ones that are used less often.
However, there is also the question: how do you pass the tree along with the encoded data? It turns out that there is a fairly simple way, if you modify slightly the algorithm used to generate the tree.
In the classic Huffman algorithm, a single set of elements and weights could generate multiple trees. In the variation used by the Deflate standard, there are two additional rules: elements that have shorter codes are placed to the left of those with longer codes. (In our previous example, D and E wind up with the longest codes, and so they would be all the way to the right.) Among elements with codes of the same length, those that come first in the element set are placed to the left. (If D and E end up being the only elements with codes of that length, then D will get the 0 branch and E the 1 branch, as D comes before E.) It turns out that when these two restrictions are placed upon the trees, there is at most one possible tree for every set of elements and their respective code lengths. The code lengths are all that we need to reconstruct the tree, and therefore all that we need to transmit.
The methods disclosed herein provide an improved approach for compressing data using the DEFLATE algorithm.