LZ77 is the common name of a lossless data compression algorithm. LZ77 is used as a part of the GNU zip (gzip) DEFLATE process, as specified in RFC 1951. FIG. 1 illustrates a conventional compression application 10 which uses the DEFLATE process to transform a file 12 into a compressed file 14. An inverse operation, denoted the INFLATE process, is used to decompress the compressed file 14 to recreate the original file 12. In the DEFLATE process, files 12 are first compressed using LZ77, and then the resulting LZ77 code is Huffman coded to provide an even better compression performance.
FIG. 2 illustrates a conventional LZ77 process 20. In the conventional LZ77 process 20, the file 12 is read character by character. In FIG. 2, the file 12 is represented by the incoming data stream 22, which is subdivided into bytes. Each byte represents one character. Each character is hashed with the preceding two characters, using a hash table 24, to provide a hash address into a dictionary. In conventional software implementations of gzip, the dictionary contains an index into a linked list 26, which contains a series of addresses (ending with a null address). Each address in the linked list 26 points to a place in the input stream, which is stored in a byte buffer 28, where the same sequence of three characters has occurred previously. In the conventional LZ77 process 20, the previous characters of the input data stream 22 are copied into the byte buffer 28, and the addresses of the linked list 26 point to locations in the byte buffer 28. Typically, these addresses are valid for positional distances up to 32K characters, because the byte buffer 28 stores the previous 32K characters.
In conventional software implementations of the LZ77 process 20, the input data stream is compared to the previous bytes (i.e., the bytes in the byte buffer 28 at the location pointed to by the address in the linked list 26) to determine how many bytes are similar. The comparator 30 performs this comparison for each address in the series of addresses corresponding to the hash address until it finds a suitable match. In other words, this process is performed serially for each address in the linked list 26 that corresponds to the hash address. The serial nature of these operations affects the speed of the conventional LZ77 implementation. Additionally, the performance of the conventional LZ77 implementations is affected by the size of the linked list 26.
The LZ77 process 20 then encodes the distance (corresponding to the location in the byte buffer 28) and the length (corresponding to the number of similar bytes starting at the location in the byte buffer 28) of the match to derive part of the LZ77 code stream. If there is no suitable match, the current byte is output as a literal, without further encoding. Hence, the LZ77 code stream is made up of encoded distance/length pairs and literals. The LZ77 code stream is then supplied to a Huffman encoder for further compression.