The LZRW1 compression algorithm was proposed by Ross N. Williams to increase the performance of the LZ77 class of compression algorithms. (The basic LZ77 algorithm is described in J. Ziu and A. Lempel, "A Universal Algorithm for Sequential Data Compression, Transactions on Information Theory, Vol. 23, No. 3, May 1977, pp. 337-343). The LZRW1 algorithm uses the single pass literal/copy mechanism of the LZ77 class of algorithms to compress an uncompressed data sequence into a compressed data sequence. Bytes of data in the uncompressed data sequence are either directly incorporated into a compressed data sequence as a string (i.e., as "literal items") or, alternatively, are encoded as a pointer to a matching set of data that has already been incorporated into the compressed data sequence (i.e., as "copy items"). The copy items are encoded by offset and length values that require fewer bits than the bytes of data. The offset specifies the offset of the string being coded relative to its previous occurrence. For example, if a string of three characters occurred six bytes before the occurrence that is being encoded, the offset is six. The length field specifies the length of the matching data sequence in bytes. Compression is realized by representing as much of the uncompressed data sequence as possible as copy items. Literal items are incorporated into the compressed data sequence only when a match of three or more bytes cannot be found.
FIG. 1 depicts an example of the operation of the LZRW1 data compression algorithm. The uncompressed data sequence is stored in an input block 10 that is a read only data structure. The input block 10 includes a history or "Lempel" portion 12 that holds the most recent 4,095 bytes of history that immediately precede the current position, as indicated by pointer 18 in the input block 10. The 16 bytes of the remaining portion of the input block 10 to be processed constitute the "Ziv" 14 portion of the input block. The Lempel portion 12 and the Ziv portion are separated by a Lempel/Ziv boundary 33. The current position pointer 18 points to the first character in the bytes that are currently being processed. The portion 16 of the input block 10 that lies to the left of the current position pointer 18 has been fully processed. The LZRW1 compression algorithm uses a hash function 26 and a hash table 28. The role of the hash function 26 and the hash table 28 will be described in more detail below.
FIG. 2 is a flowchart that shows the high level steps that are performed by the LZRW1 data compression algorithm. First, a hash for the next three bytes 22 that are to be processed in the input block is generated using the hash function 26. The next three bytes are those that immediately follow the current position pointer 18. In the example shown in FIG. 1, the next three bytes 22 are "cab" (assuming that each character is encoded by a byte length encoding). The hash of the three bytes 22 is generated using the hash function 26 (see step 34 in FIG. 2). The resulting hash serves as an index into the hash table 28, and is used to index an entry 30 within the hash table (step 36 in FIG. 2). The pointer 32 is remembered temporarily and the hash table entry 30 is updated to hold a pointer to the beginning of the Ziv portion 14 (step 38 in FIG. 2).
A determination is then made to determine whether the fetched pointer 32 that was retrieved from the hash table entry 30 points to a location within the Lempel portion 12 and points to a match with the 3 bytes in the Ziv (step 40 in FIG. 2). In the example shown in FIG. 1, the pointer 32 points to a location within the Lempel portion 12 and matches. As such, the three bytes 22 are encoded as a copy item (step 42 in FIG. 2). If, however, the pointer 32 does not point within the Lempel portion 12, the three bytes 22 are encoded as literal items (step 44 in FIG. 2). The Lempel/Ziv boundary 33 and current position pointer 18 are shifted accordingly (step 46 in FIG. 2). If the three bytes 22 are encoded as a copy item, the Lempel/Ziv boundary is shifted to lie immediately after the last byte that was encoded by the copy item. On the other hand, if the encoding is for a literal item, only a single byte (i.e., the byte pointed to by the current position, pointer 18) is encoded, and the Lempel/Ziv boundary 33 is shifted to lie immediately after that character. For example, if the character "c" were to be encoded as a literal item for the three bytes 22, the Lempel/Ziv boundary 33 would be shifted towards the end of the input buffer 10 by one character in FIG. 1. The system then checks whether it is done processing input (step 48 in FIG. 2). The algorithm is completed when all of the characters in the input buffer 10 have been processed.
FIG. 3 is a block diagram that illustrates the format of the compressed data block 50 that results from applying the LZRW1 compression algorithm. Specifically, the compressed data block 50 that results from application of the LZRW1 compression algorithm is divisible into code words (CW) 52 followed by literal and copy items 54. Each code word 52 holds 16 bits of flags that indicate whether an associated item in the items 54 that follow the code word is encoded as a literal item or as a copy item. A zero value for a bit in the code word indicates that the associated item is a literal item. A one value for a bit in the code word indicates that the associated item is a copy item. Thus, it can be seen from FIG. 3, that the compressed data sequence 50 consists of a sequence of 16 bit code words 52 and 16 associated items 54.
FIG. 4A illustrates the format of a literal item 56. A literal item holds literal data and is a byte in length. A copy item 57 (FIG. 4B), in contrast, is two bytes in length and holds a first byte 58A that is divisible in half into an "a" field and a "b" field. The second byte 58B holds a "c" field. These two bytes 58A and 58B are used to hold values that encode the length and offset. The value of the length is encoded in the "b" field. The length of the matching data sequence is calculated as the value held in the "b" field plus one. The offset is calculated as 256 times the value held in the "a" field plus the value held in the "c" field. The resulting range of offsets is between 1 and 4,095.